p-search: a local search engine in Emacs
Zac Romero - zacromero@posteo.com
The following image shows where the talk is in the schedule for Sat 2024-12-07. Solid lines show talks with Q&A via BigBlueButton. Dashed lines show talks with Q&A via IRC or Etherpad.
Format: 23-min talk ; Q&A: BigBlueButton conference room https://media.emacsconf.org/2024/current/bbb-p-search.html Etherpad: https://pad.emacsconf.org/2024-p-search
Etherpad: https://pad.emacsconf.org/2024-p-search
Discuss on IRC: #emacsconf-dev
Status: Q&A open for participation
Saturday, Dec 7 2024, ~9:20 AM - 9:45 AM MST (US/Mountain)
Saturday, Dec 7 2024, ~8:20 AM - 8:45 AM PST (US/Pacific)
Saturday, Dec 7 2024, ~4:20 PM - 4:45 PM UTC
Saturday, Dec 7 2024, ~5:20 PM - 5:45 PM CET (Europe/Paris)
Saturday, Dec 7 2024, ~6:20 PM - 6:45 PM EET (Europe/Athens)
Saturday, Dec 7 2024, ~9:50 PM - 10:15 PM IST (Asia/Kolkata)
Sunday, Dec 8 2024, ~12:20 AM - 12:45 AM +08 (Asia/Singapore)
Sunday, Dec 8 2024, ~1:20 AM - 1:45 AM JST (Asia/Tokyo)
Duration: 22:42 minutes00:00.000 Search in daily workflows 01:24.200 Problems with editor search tools 03:58.233 Information retrieval 04:34.296 Search engine in Emacs: the index 06:21.757 Search engine in Emacs: Ranking 06:43.553 tf-idf: term-frequency x inverse-document-frequency 07:41.160 BM25 08:41.200 Searching with p-search 10:41.457 Flight AF 447 16:06.771 Modifying priors 20:40.405 Importance 21:38.560 Complement or inverse
Description
Search is an essential part of any digital work. Despite this importance, most tools don't go beyond simple string/regex matching. Oftentimes, a user knows more about what they're looking for: who authored the file, how often it's modified, as well as search terms that the user is only slightly confident exist.
p-search is a search-engine designed to combine the various prior knowledge about the search target, presenting it to the user in a systematic way. In this talk, I will present this package as well as go over the fundamentals of inforation retrieval.
Details:
In this talk, I will go over the p-search. p-search is a search-engine to assist users in finding things, with a focus on flexibility and customizablity.
The talk will begin by going over concepts from the field of information retrieval such as indexing, querying, ranking, and evaluating. This will provide the necessary background to describe the workings of p-search.
Next, an overview of the p-search package and its features will be given. p-search utilizes a probabilistic framework to rank documents according to prior beliefs as to what the file is. So for example, a user might know for sure that the file contains a particular string, might have a strong feeling that it should contain another word, and things that some other words it may contain. The user knows the file extension, the subdirectory, and has known that a particular person works on this file a lot. p-search allows the user to express all of these predicates at once, and ranks documents accordingly.
The talk will then progress to discuss assorted topics concerting the project, such as design considerations and future directions.
The aim of the talk is to expand the listeners' understanding of search as well as inspire creativity concerning the possibilities of search tools.
Code: https://github.com/zkry/p-search
Transcript
[00:00:00.000] Search in daily workflows
Hello, my name is Zachary Romero, and today I'll be going over p-search, a local search engine in Emacs. Search these days is everywhere in software, from text editors, to IDEs, to most online websites. These tools tend to fall into one of two categories. One are tools that run locally, and work by matching string to text. The most common example of this is grep. In Emacs, there are a lot of extensions which provide functionality on top of these tools, such as projectile-grep, deadgrep, consult-ripgrep. Most editors have some sort of search current project feature. Most of the time, some of these tools have features like regular expressions, or you can specify file extension, or a directory you want to search in, but features are pretty limited. The other kind of search we use are usually hosted online, and they usually search a vast corpus of data. These are usually proprietary online services such as Google, GitHub, SourceGraph for code.
[00:01:24.200] Problems with editor search tools
The kind of search feature that editors usually have have a lot of downsides to them. For one, a lot of times you don't know the exact search string you're searching for. Some complicated term like this high volume demand partner, you know, do you know if... Are some words abbreviated, is it capitalized, is it in kebab case, camel case, snake case? You often have to search all these variations. Another downside is that the search results returned contain a lot of noise. For example, you may get a lot of test files. If the tool hits your vendor directory, it may get a bunch of results from libraries you're using, which most are not helpful. Another downside is that the order given is, well, there's no meaning to the order. It's usually just the search order that the tool happens to look in first. Another thing is, so when you're searching, you oftentimes have to keep the state of the searches in your head. For example, you try one search, you see the results, find the results you think are relevant, keep them in your head, run search number two, look through the results, kind of combine these different search results in your head until you get an idea of which ones might be relevant. Another thing is that the search primitives are fairly limited. So yeah, you can search regular expressions, but you can't really define complex things like, I want to search files in this directory, and this directory, and this directory, except these subdirectories, and accept test files, and I only want files with this file extension. Criteria like that are really hard to... I'm sure they're possible in tools like grep, but they're pretty hard to construct. And lastly, there's no notion of any relevance. All the results you get back, I mean, you don't know, is the search more relevant? Is it twice as relevant? Is it 100 times more relevant? These tools usually don't provide such information.
[00:03:58.233] Information retrieval
There's a field called information retrieval, and this deals with this exact problem. You have lots of data you're searching for. How do you construct a search query? How do you get results back fast? How do you rank which ones are most relevant? How do you evaluate your search system to see if it's getting better or worse? There's a lot of work, a lot of books written on the topic of information retrieval. If one wants to improve searching in Emacs, then drawing inspiration from this field is necessary.
[00:04:34.296] Search engine in Emacs: the index
The first aspect of information retrieval is the index. The reverse index is what search engines use to find results really fast. Essentially, it's a map of search term to locations where that term is located. You'll have all the terms or maybe even parts of the terms, and then you'll have all the locations where they're located. Any query could easily look up where things are located, join results together, and that's how they get the results to be really fast. For this project, I decided to forgo creating an index altogether. An index is pretty complicated to maintain because it always has to be in sync. Any time you open a file and save it, you would have to re-index, you would have to make sure that file is re-indexed properly. Then you have the whole issue of, well, if you're searching in Emacs, you have all these projects, this directory, that directory, how do you know which? Do you always have to keep them in sync? It's quite a hard task to handle that. Then on the other end, tools like ripgrep can search very fast. Even though they can't search maybe on the order of tens of thousands of repositories, for a local setting, they should be plenty fast enough. I benchmarked. Ripgrep, for example, is on the order of gigabytes per second. Definitely, it can search a few pretty big size repositories.
[00:06:21.757] Search engine in Emacs: Ranking
Next main task. We decided not to use an index. Next task is how do we rank search results? So there's two main algorithms that are used these days. The first one is tf-idf, which stands for term frequency, inverse target frequency. Then there's BM25, which is sort of a modified tf-idf algorithm.
[00:06:43.553] tf-idf: term-frequency x inverse-document-frequency
tf-idf, without going into too much detail, essentially multiplies two terms. One is the term frequency, and then you multiply it by the inverse document frequency. The term frequency is a measure of how often that search term occurs. The inverse document frequency is a measure of how much information that term provides. If the term occurs a lot, then it gets a higher score in the term frequency section. But if it's a common word that exists in a lot of documents, then its inverse document frequency goes down. It kind of scores it less. You'll find that words like the, in, is, these really common words, since they occur everywhere, their inverse document frequency is essentially zero. They don't really count towards a score. But when you have rare words that only occur in a few documents, they're weighted a lot more. So the more those rare words occur, they boost the score higher.
[00:07:41.160] BM25
BM25 is a modification of this. It's essentially TF, it's essentially the previous one, except it dampens out terms that occur more often. Imagine you have a bunch of documents. One has a term 10 times, one has a term, that same term a hundred times, another has a thousand times. You'll see the score dampens off as the number of occurrences increases. That prevents any one term from overpowering the score. This is the algorithm I ended up choosing for my implementation. So with a plan of using a command line tool like ripgrep to get term occurrences, and then using a scoring algorithm like BM25 to rank the terms, we can combine this together and create a simple search mechanism.
[00:08:41.200] Searching with p-search
Here we're in the directory for the Emacs source code. Let's say we want to search for the display code. We run the p-search command, starting the search engine. It opens up. We notice it has three sections, the candidate generators, the priors, and the search results. The candidate generators generates the search space we're looking on. These are all composable and you can add as many as you want. So with this, it specifies that here we're searching on the file system and we're searching in this directory. We're using the ripgrep tool to search with, and we want to make sure that we're searching only on files committed to Git. Here we see the search results. Notice here is their final probability. Here, notice that they're all the same, and they're the same because we don't have any search criteria specified here. Suppose we want to search for display-related code. We add a query: display. So then it spins off the processes, gets the search term counts and calculates the new scores. Notice here that the results that come on top are just at first glance appear to be relevant to display. Remember, if we compare that to just running a ripgrep raw, notice here we're getting 53,000 results and it's pretty hard to go through these results and make sense of it. So that's p-search in a nutshell.
[00:10:41.457] Flight AF 447
Next, I wanted to talk about the story of Flight 447. Flight 447 going from Rio de Janeiro to Paris crashed somewhere in the Atlantic Ocean on June 1st, 2009, killing everyone on board. Four search attempts were made to find the wreckage. None of them were successful, except the finding of some debris and a dead body. It was decided that they really wanted to find the wreckage to retrieve data as to why the search occurred. This occurred two years after the initial crash. With this next search attempt, they wanted to create a probability distribution of where the crash could be. The only piece of concrete data they had was a GPS signal from the ship at 210 containing the GPS location of the plane was at 2.98 degrees north, 30.59 degrees west. That was the only data they had to go off of. So they drew a circle around that point with a radius of 40 nautical miles. They assumed that anything outside the circle would have been impossible for the ship to reach. This was the starting point for creating the probability distribution of where the wreckage occurred. Anything outside the circle, they assumed it was impossible to reach. The only other pieces of data were the four failed search attempts and then some of the debris found. One thing they did decide was to look at similar crashes where control was lost to analyze where the crashes landed, compared to where the loss of control started. This probability distribution, the circular normal distribution was decided upon. Here you can see that the center has a lot higher chance of finding the wreckage. As you go away from the center, the probability of finding the wreckage decreases a lot. The next thing they looked at was, well, they noticed they had retrieved some dead bodies from the wreckage. So they thought that they could calculate the backward drift on that particular day to find where the crash might've occurred. If they found bodies at a particular location, they can kind of work backwards from that in order to find where the initial crash occurred. So here you can see the probability distribution based off of the backward drift model. Here you see the darker colors have a higher probability of finding the location. So with all these pieces of data, so with that circular 40 nautical mile uniform distribution, with that circular normal distribution of comparing similar crashes, as well as with the backward drift, they were able to combine all three of these pieces in order to come up with a final prior distribution of where the wreckage occurred. So this is what the final model they came upon. Here you can see it has that 40 nautical mile radius circle. It has that darker center, which indicates a higher probability because of the crash similarity. Then here you also see along this line has a slightly higher probability due to the backward drift distribution. So the next thing is, since they had performed searches, they decided to incorporate the data from those searches into their new distribution. Here you can see places where they searched initially. If you think about it, you can assume that, well, if you search for something, there's a good chance you'll find it, but not necessarily. Anywhere where they searched, the probability of it finding it there is greatly reduced. It's not zero because obviously you can look for something and miss it, but it kind of reduces the probability that we would expect to find it in those already searched locations. This is the posterior distribution or distribution after counting observations made. Here we can see kind of these cutouts of where the previous searches occurred. This is the final distribution they went off of to perform the subsequent search. In the end, the wreckage was found at a point close to the center here, thus validating this methodology.
[00:16:06.771] Modifying priors
We can see the power of this Bayesian search methodology in the way that we could take information from all the sources we had. We could draw analogies to similar situations. We can quantify these, combine them into a model, and then also update our model according to each observation we make. I think there's a lot of similarities to be drawn with searching on a computer in the sense that when we search for something, there's oftentimes a story we kind of have as to what search terms exist, where we expect to find the file. For example, if you're implementing a new feature, you'll often have some search terms in mind that you think will be relevant. Some search terms, you might think they have a possibility of being relevant, but maybe you're not sure. There's some directories where you know that they're not relevant. There's other criteria like, well, you know that maybe somebody in particular worked on this code. What if you could incorporate that information? Like, I know this author, he's always working on this feature. What if I just give the files that this person works on a higher probability than ones he doesn't work on? Or maybe you think that this is a file that's committed too often. You think that maybe the amount of times of commits it receives should change your probability of this file being relevant. That's where p-search comes in. Its aim is to be a framework in order to incorporate all these sorts of different prior information into your searching process. You're able to say things like, I want files authored by this user to be given higher probability. I want this author to be given a lower priority. I know this author never works on this code. If he has a commit, then lower its probability, or you can specify specific paths, or you can specify multiple search terms, weighing different ones according to how you think those terms should be relevant. So with p-search, we're able to incorporate information from multiple sources. Here, for example, we have a prior of type git author, and we're looking for all of the files that are committed to by Lars. So the more commits he has, the higher probability is given to that file. Suppose there's a feature I know he worked on, but I don't know the file or necessarily even key terms of it. Well, with this, I can incorporate that information. So let's search again. Let's add display. Let's see what responses we get back here. We can add as many of these criteria as we want. We can even specify that the title of the file name should be a certain type. Let's say we're only concerned about C files. We add the file name should contain .c in it. With this, now we notice that all of the C files containing display authored by Lars should be given higher probability. We can continue to add these priors as we feel fit. The workflow that I found helps when searching is that you'll add criteria, you'll see some good results come up and some bad results come up. So you'll often find a pattern in those bad results, like, oh, I don't want test files, or this directory isn't relevant, or something like that. Then you can update your prior distribution, adding its criteria, and then rerun it, and then it will get different probabilities for the files. So in the end, you'll have a list of results that's tailor-made to the thing you're searching for.
[00:20:40.405] Importance
There's a couple of other features I want to go through. One thing is that each of these priors, you can specify the importance. In other words, how important is this particular piece of information to your search? So here, everything is of importance medium. But let's say I really care about something having the word display in it. I'm going to change its importance. Instead of medium, I'll change its importance to high. What that does essentially is things that don't have display in it are given a much bigger penalty and things with the word display in it are rated much higher. With this, we're able to fine-tune the results that we get.
[00:21:38.560] Complement or inverse
Another thing you can do is that you can add the complement or the inverse of certain queries. Let's say you want to search for display, but you don't want it to contain the word frame. With the complement option on, when we create this search prior, now it's going to be searching for frame, but instead of increasing the search score, it's going to decrease it if it contains the word frame. So here, things related to frame are kind of deprioritized. We can also say that we really don't want the search to contain the word frame by increasing its importance. So with all these composable pieces, we can create kind of a search that's tailor-made to our needs. That concludes this talk. There's a lot more I could talk about with regards to research, so definitely follow the project if you're interested. Thanks for watching, and I hope you enjoy the rest of the conference.
Captioner: sachac
Questions or comments? Please e-mail zacromero@posteo.com