Emacs regex compilation and future directions for expressive pattern matching

Danny McClanahan (they/them) - Pronunciation: məˈklænəˌhæn / mk-CLAN-uh-han, han like "hand", IRC: cosmicexplorer, @hipsterelectron@circumstances.run on mastodon - @hipsterelectron on twitter - @cosmicexplorer on github - @cosmicexplorer in #emacsconf on irc.libera.chat, dmcC2@hypnicjerk.ai

Format: 25-min talk ; Q&A: IRC
Status: TO_FOLLOW_UP

Duration: 24:56 minutes

Description

Emacs is a delightful case study for the capabilities of regular expressions, because Emacs forms user interfaces via text, which retains the expressivity of a GUI with the user-level interactivity of written language. Because we use text for both input and output, regexps in Emacs form part of a user-level grammar for human thought. As a result, Emacs and Emacs users have a rich intuitive grasp of regular expressions, which provides a unique vantage point to consider how they may be improved in general.

When I began my investigation, I assumed that Emacs would be able to use an existing off-the-shelf regex engine, that this would be more performant than regex-emacs.c, and that the greatest challenge would be providing a sufficiently robust build process (see emacs-devel: https://lists.gnu.org/archive/html/emacs-devel/2024-04/msg00142.html). However, I quickly found that Emacs (as usual) is far more configurable than alternatives (see rust regex discussion: https://github.com/rust-lang/regex/discussions/1167#discussioncomment-8585027). Now don't get this twisted: emacs-devel was open to deprecating functionality that hampered optimization! But the biggest challenge by far is that regex-emacs.c is categorically more powerful than alternatives: it can match against non-contiguous input (across both halves of the gap buffer), as well as non-UTF8 text with its fantastic multibyte encoding (see https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html).

So a more complex picture begins to emerge: Emacs actually uses regexps far more widely and deeply than anywhere else, and its regex engine requirements aren't "legacy", but the result of caring more deeply about language than anywhere else. While regex engines have historically been known to introduce functionality not backed by formal theory that's later found to be hard to optimize, Emacs instead charts a path for other engines to follow. Formalizing backrefs is state-of-the-art, but possible (see https://jamiejennings.com/posts/2021-09-23-dont-look-back-2/), and I believe the same can be achieved for the other affordances Emacs users have come to expect. Subsequently, I have focused on identifying where we can constrain the problem space to improve performance without losing those affordances, such as explicit precompilation in lisp code (see https://lists.gnu.org/archive/html/emacs-devel/2024-08/msg00108.html).

There are many branching paths here. With the libgccjit native compiler, we can now implement regex matching in lisp itself. While `rx' can compose patterns to an extent, we could provide a more powerful primitive than regular expressions alone for complex parsing tasks. And while many regex engines employ complex optimization heuristics, we can instead introduce specific functionality for e.g. SIMD literal search into lisp code, allowing lisp users to intelligently select for themselves how and when to employ less-powerful but more-performant search routines.

We don't need to backtrack! We can try all these paths at once.

About the speaker:

Me: Typing free software to break the shoulders of giants from golden handcuffs. Work: Composeable build tools, parsing frameworks, and cryptographic messaging. All of these are misused to subjugate and oppress, but I build infrastructure for positive liberties.

This talk will cover my train of thought over the course of this year on how regex engines in general may be improved, and the discussions with emacs-devel that have helped me along. I hope this talk will convince people of the boundless future directions in text search. My PhD research will be inspired by the expressivity and power of Emacs.

Discussion

Questions and answers

  • Q: A bit off topic, but how did you get the emoji into your slides? I\'m assuming you exported via Beamer to PDF. Thank you very much for the swift answer. Great presentation, too🙏🏻

Notes

  • i have a 50-minute version of this talk which i will be posting somewhere on my page https://hypnicjerk.ai after the conference!
    • oh good! I wish the last talk I attended with this many slides could have done that (Florian Weimer's traditional future directions for glibc talk at the GNU Tools Cauldron: every year he gets through a third of it and puts the rest on the schedule for next year!)
  • great, the slides are now available at https://media.emacsconf.org/2024/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--slides.pdf and from the talk page
  • i was not able to add subtitles in time for the conference, so please please ask questions here or on irc during the talk (even just asking for what i just said) and i will do my best to answer all of them!
  • Something you might be interested in Rak a lesser known grep alternative dosent seem to have a emacs frontend though
  • followup on emacs-devel with NullNix's suggestion to make the cache buffer-local: https://lists.gnu.org/archive/html/emacs-devel/2024-12/msg00299.html

  • I think having an LLM do this is just perfect! all the people asking for it want is comforting lies anyway, and LLMs are really good at those!

    • LLM's can be run locally. for example using localai
    • cosmicexplorer: yes! but the weights come from somewhere! they come from training in cloud services!
    • Running locally is not the same as reproduce it localy... I guess...
    • It is like having a proprietary binary blob running on the linux kernel.
    • cosmicexplorer: that is true. it does get a bit iffy when running open source models trained on remote services when using localai.io
    • cosmicexplorer: inflicting a hidden dependency on my users :( :(
  • on other things you should never do, that AI adjustment of the speaker image is really annoying

    • I have literally blanked off that part of my screen with a piece of paper so I don't have to see it, sorry
    • excellent talk though!! wish it was twice as long
    • cosmicexplorer: yes :( thanks for feedback. next time i won't be so embarrassed with my bed
    • cosmicexplorer: i captured this live in obs with the filtering so i don't even have the video stream without it
    • I recently told somebody about Nvidia brodcast studio for the good green screen removel which annoyed me becouse it is 1 not open source. 2 I use amd and can't use it nor is it multiplatform 3 I use linux and don't know if you can run if from linux :( anybody know of a better solution?
  • ohhh I never realised the reason the match data isn't reified was so tied up with the implementation. not too surprising in hindsight, thats the emacs way :)

  • I would recommend having the regex cache be in a buffer-local variable. most of the speedups, works everywhere maybe?
    • cosmicexplorer: SMART!!!!!!
    • cosmicexplorer: could also then explicitly have a cache busting API
  • Q: What about tree-sitter? Is it better? Does it uses regexps?
    • cosmicexplorer: basically: yes tree-sitter solves this, but no it does not use regexps
    • cosmicexplorer: so it's really much more applicable for well-specified programming language definitions
    • cosmicexplorer: but it means we can let tree-sitter solve problems we don't want to ourselves
    • cosmicexplorer: they depend on the current syntax table which is buffer-local, and on case-folding from the current buffer
      • so only buffer-local, so a buffer-local cache should be the right level then
      • just making sure it wasn't anything finer-grained than that :)
      • cosmicexplorer: yes! and also they very very rarely change
      • hm, you could probably share many buffer-local caches with identical values for syntax tables, case folding etc even :)
      • more complex though, and likely marginal gains
  • worth making sure regexes can't depend on things like overlays, but I don't see how they could or a match over a buffer might require recompilation in the middle of a match, which is overkill even for emacs :P
  • using orderless and consult with consult ripgrep means you have the performant part outside of emacs and powerful emacs facilites don't need to be as performant while still being fast
    • cosmicexplorer: looking up orderless and consult now :)
  • I was meaning with the consult-ripgrep command from the consult package
  • wonders about combining this with p-search from yesterday... :)

  • This is brilliant

  • Great talk!
  • That was great, thank you, your enthusiasm is infectious!
  • That was a great talk, thanks!

Transcript

Hello, I'm Danny McClanahan. This is EmacsConf 2024. And this presentation is ostensibly about Emacs Regex compilation. But it'll lead a lot more in future directions. Thanks for coming on this journey with me. This presentation is 50 slides, 50 footnotes, and that's intended for it to be a resource later on for your perusal. We are unfortunately not going to be able to go into all of it, but I will try to be within 20 minutes so we can make it throughout Q&A. This is the structure of the talk. But enough about me. Who are you? And why are you here? I'm Danny McClanahan. My experience is a lot in build tools, especially in the package managers. That started because I realized I was wasting a lot of time. Then I didn't like that. I started wasting a lot of time, trying to avoid wasting time. Then I ended up... going so far around that I ended up stopping other people from wasting their own time, in this case, regarding failing builds. But this is a kind of pattern that you'll see. I'm talking a lot about patterns in this presentation. Parsing in text is another one of those tendencies that I have. Why am I here? I've got a lot of feelings about text. For the next 20 minutes, I'm making it your problem. First off, a huge shout out to Emacs Devel and the Emacs community in general. I spent a lot of time learning about what I'm about to talk about. I was definitely super confused at first. Then when I became less confused and I decided I was going to look at the regular expressions of the Regex engine, I was like, oh, it's old C code. It's Emacs. We can just use modern techniques. Turns out that's wrong for kind of two reasons. One, because using modern techniques or other engines don't necessarily do what Emacs regex engine currently does. Then secondarily, that's not actually as interesting as the other kind of larger goals that emacs-devel discussed. Thank you, Eli Zaretskii, so, so much, especially Pip Cet and everyone else as well--I believe--Pip Cet, I hope I'm pronouncing that correctly. Thank you so much. I'll be shouting you out later as well. Then these larger goals ended up overlapping a lot with my own research interests. And that's very exciting. I'm hoping it's exciting for you too. What is a regular expression? And when and how does implementation match formal theory? So what does formal theory mean? And we'll talk about that. What is a regular expression? So I might ask you this question, and you might give an answer. Then I might ask someone else, and they might have an answer. Then I might ask myself, and I might try to think of an answer. Our answers would, you know, see, the thing is, they'd all be correct, but they'd probably be slightly different, and they'd be different in kind of important ways. I'm using formal theory to kind of describe what unifies these interpretations and what causes this sort of divergence, both over time and then across code bases. I'm kind of putting a flag in the ground here and saying formal theory is actually a really, really negative influence, I think, but it can be better. That's what I'm going to talk about in this talk, in this presentation. We might ask, how did this happen? and we might try to find a start state. We might put that place at the theories of formal languages that kind of arose, especially post Turing and post Chomsky. Especially there was this really, really interesting and powerful relationship with formal languages between representation and computation. And then on top of that, we have regex as this really powerful union of theory and practice And then, like I mentioned, this is kind of divergence that kind of occurs. This divergence happens for a good reason. This happens because people were adding implementations and people adding features to implementations. While the people adding these features were often academics, they were industries, people that were hobbyists, they were interested in building practical tools. This is a good thing. This is still a good thing, even though it moves a little bit away from formal theory. But we start seeing some cracks developing, and we'll go into that in a second. We're just going to kind of electric slide into the 1980s here, and we're going to be confronted with two occurrences very similarly. We might call it simultaneous discovery. In 1983, you have Michael Jackson demonstrating the moonwalk. Three years later, we have backtracking developed to stimulate EGREP-style regular expressions. These would both be incredibly influential in their own kind of branching paths. Here's where the gloves come off. Formal theory, I claim, remains largely concerned with incremental improvements to artificial benchmarks, and much less with expanding models to cover actual user needs. This isn't just about, oh, if you listened to users, that you'd be a nicer person, you'd be a better engineer. What I'm actually saying is that they're missing out. When you don't listen to applications, you miss out on a lot of fantastic opportunities for novel theory. So this is, again, my complaint with formal theory as it stands. But we're gonna do better. Before we get better, we're gonna get, a little bit worse for a bit. We're going to actually get a little bit worse is better. What I mean by that is, by the 1990s, we start looking into these non-backtracking engines. This is a bit of a reaction to backtracking. The current ones include RE2, hyperscan, and the rust regex library. These are all great. I'll talk about them later as well. They make use of these. They kind of call back to the earlier formal theory. They have linear runtimes for well-specified search tasks. What happens if that doesn't fit your needs? We're going to talk about that. We're going to table that for a second, and we're going to focus more on Emacs, the subject of this conference. What are regex used for? And in this particular case, they're used for lots of things, with practically, and I think they should be. But more specifically, how do Emacs users use them? And I'm going to focus in on this text as input and output. I'll be kind of elaborating on this analogy as we continue. Why is text powerful? Text as I/O. The reason text programming languages and not just programming languages, but languages themselves, the reason why they're successful and why they propagate, I claim, is because text is both input readable and output writable. What this means is that if you receive something in text, you can read it, And then you can also write it, you can modify it, and you can produce a new version of it. You're on a kind of level playing field. That's not always the case, though. You recall that I've worked a lot with build systems and package managers. There's a discussion that goes by the name of software supply chain security. I think it's a massive joke. The reason why is because people largely raise it to explain why their for-profit company with their for-profit product is going to solve the problem for you, as opposed to the commons of open source. If you are unable to modify or deploy your code without employing an opaque external system, I think, then you have a hidden dependency. you don't remove a dependency, you just, by, for example, paying into a for-profit product or using a closed-off supply chain, you end up just having a hidden dependency, you end up just displacing that. This can actually exert arbitrary control over your programming output and potentially even your thoughts. This is really important. I'm going to dive in a little bit deeper and I'm going to overload the term locality here. I'm going to say, if you cannot reproduce a system locally, it becomes an opaque external system. I'm going to give examples here, and these are going to be a bit of a hot take. First off, GUI IDEs. I think we might, well, some of us might agree with that here. I say development environments that only allow you to use a graphical interface, do not expose interaction with text, are explicitly trying to kind of place you on a separate kind of plane where you're not an equal contributor to the people who make the development environment, make the development kind of frameworks here. We'll go one further. Cloud services are precisely, you know, they're useful for things that, you know, that require large domain computation, but, you know, Twitter, for example, didn't actually ever use any cloud services, external ones, because it was really important for them to actually own their own hardware, their own computation, their own thinking. Cloud services are a way to ensure that you're unable to reproduce a system without paying an amount per month, an amount per day, an amount per second, an amount per cycle to an external entity. I'm just going to conclude this with, I'd say, the argumentum ad absurdum, here, where large language models are all of these at once. They are a cloud service, specifically, and this is what makes them very evil, to make it so that, similar to GUI IDEs, so that text itself loses that ability to be both readable and writable. Instead, text is both unreadable, because it's produced by a machine, and then also unwritable, because you're subservient and subjugated to the machine, to the large language model to produce the code in the first place. You lose this input, output, readable, writable behavior that I claim text has specifically. To underline this, what is text? Text is local. Finally, we're at the subject of this conference. Emacs, I have double hearts with text. I start off the slide saying Emacs is a text editor. I think that's a good start. Which implements much of its own logic and user interface via text. What this means is that, you know, I say without trying, Emacs tries very hard, but without trying so hard, Emacs, is imbued with all of the capabilities that text has specifically. When you use text like Emacs does, and particularly you then start offering mechanisms to query, to transform, and to generally metaprogram text itself, you don't just have the ability to edit code in new ways. And this is something that I think is often lost, maybe not by participants of this conference, you particularly start being able to not only just edit code differently, but to change the way that you think about code and actually to expand your range of thought, the range of actions that you can perform. You can actually start then editing at the speed of thought. This is where especially Regex kind of comes into play. Finally, we get to the subject of the title of this talk. I'm about to disappoint a lot of people. I claim for good reason. Unfortunately, it's a very brief walkthrough, but I'm going to go over what the current Emacs Redix engine is. This is going to give us enough context for the next section on future directions. Quickly, it's a backtracking engine over a multi-byte code point. I'll define what that means. It's in regex-emacs.c. It's invoked in two ways, which you'll see is actually the same way, over a single contiguous string input. This is a Lisp string that you pass in. or over the two halves of the gap buffer. This is when you match against a buffer text. We'll go into that a little bit more, but this is one of the really actually interesting and specific things about Emacs Regex Engine as it stands. So very, very quickly, this is the data layout. This is just, if you're interested, this is where the code lies. So regex-emacs.h has re-pattern buffer, which is a struct Actually, you know, I love, by the way, I love the Emacs C source code. It's so nice to read. It made all this so, so easy. I really appreciated it. In this particular case, I'm just going to focus on re-pattern buffer actually has the compiler. It's a C struct. It has every single thing that is needed to execute the regular expression against a string input or against a buffer input. This buffer, It's not an Emacs buffer. It refers to just the instruction table and the match loop. Again, this is very, very brief, but I want to specifically focus on the first part. So this is this inner matching loop, and there's a prologue, and then there's a loop body, and there's an epilogue. And the prologue is the really, really interesting part. I say extract current and next char. What Emacs does here, it doesn't just reach for the next byte. It actually will perform lazily in some sense, this variable integer size VAR decoding for multi-byte, and it'll actually then decode the next one to four bytes. Up to 32 bits at once, and then it'll actually go into the loop. We'll talk about the implications of that later. Next, in the body of the loop, we read the instruction from the instruction pointer, which is, again, in that buffer field. Then we have this big switch statement, which is actually, love a big switch statement, super easy to read, super easy to understand kind of what's occurring. Then that's the loop body. And then at the end of it, we either increment the instruction pointer if it was matching a single character or something along those lines, or if it was a jump, we don't do that. A jump, however, it's not referring to a jump in the sense of a go-to, but a jump that's elsewhere within that table, that buffer field. If you've included a capture, we write that end position there. Of course, well, as you may recall, the zeroth capture is, of course, the entire match string. If the capture is zero, then we know we've actually completed that match. That's really great. I would love to receive Q&A about this as well. I've spent a lot of time kind of learning and understanding it. But it's really interesting that this can be described in a single slide because it's really simple. That simplicity is actually a really powerful thing. I'll mention that in the next section. I say, is that all? And I apologize for not doing so. But please, please ask questions in Q&A or message me about this, because I think it's really, really, again, interesting. Again, I find the code relatively easy to read. Now, here's, I think this is actually the point of the talk. The rest of it was, you know, I think just me posturing. This is the really, really interesting part. This is the ways that we can improve, well, not just we can improve stuff in Emacs, but why those are the right things to improve. Then also how that can be a model for even things outside of Emacs. This is gonna be a lot of text. I'm not gonna go through all of it. This is the one thing that I tried. This is the thing that I thought would be a slam dunk, easy solution. My initial thought process was, well, We tried very hard to do an LRU cache here. It works. It's actually very effective. However, though, we don't actually give the user, the list programmer, the ability to then say, I know that this regex is something that is going to be used again. I made an artificial benchmark. I made an artificial benchmark because I wanted to show there is one very specific case that it does solve, but it's the same issue with the artificial benchmarks. mentioned earlier. It's very specifically crafted in order to show that this particular solution would produce some speedup. What this means is it just creates more than 20 regexps in a row. It compiles them. Then, of course, because we just don't pay the compile costs, because we don't go through that cache eviction process, it ends up being faster. But this isn't really mean very much, particularly the goal here, you know, the goal would have been to show that the compile cache is actually causing the performance issue in comparison to pre-compiling it. That's not something I've been able to show. Match over bytes, not cars. So this is when I said at the beginning, oh, I came in and I think, oh, we can just use modern regex engine techniques. This is really what I meant. In particular, I mentioned in this match loop here that there's this, prolog that does this varring decoding. What this means is that every single iteration of that loop is going to be interspersed with this not being able to read a fixed number of bytes, but a variable number of bytes just depending upon the Unicode character or the Unicode code point or the multibyte code point. So this ends up, again, being relatively difficult to optimize because processors operate over bytes and not over code points. Yes, we might consider a multi-byte CPU at some point. But this is a really, really simple thing. It's just generating automata that operate over bytes as opposed to code points. This kind of goes into the much more abstract one. There's a lot of text here, and we're not going to go into it. But the really, really important point that I'm specifically mentioning here is this explicit control over linguistic complexity. That's the abstract kind of point. I want to introduce the inputs and the outputs. Basically, when you perform a search, or a match, or a parse, those are different tasks. They'll have different expected inputs and different desired outputs. Right now, Emacs, the API for the regular expression engine and for matching, It doesn't allow specialization on this. Or rather, if we do specialize on particular inputs, if we have a heuristic to check if a regex is actually a literal string, that's not something that the user actually has control over. For example, you can make a mistake escaping something, and then you don't have a literal, and then you accidentally have behavior that you totally didn't expect. Not just correctness issues, but also performance issues. I really like this one. I like this a lot, because I didn't think of it at all. I think it's better than in all of my ideas. This was proposed, at least to me, by Pip Cet, and I really hope that I'm pronouncing your name correctly. I'm sorry I didn't ask you beforehand, emacs-devel. In particular, this was after a couple of responses where I was trying to say, oh, I want to give the list programmer, way back in here, I want to give the list programmer the ability to control compilation in some sense. you know, he mentioned, I think he is correct, you know, there's no real introspection. That happens because it's written in C. I was thinking, oh, if I turn this into a list object that gives the list programmer the power and the ability to do more with that, but it doesn't actually because it's still in C. At first, I was thinking, oh, we can make the C part more flexible. But actually, especially if we want to do almost any of the things we previously mentioned, I think basically that this is... I think that if I'm not going to do it, somebody else really should do it, and I think we should maybe even do it together, because I think this is really, I think, how we can start experimenting, and not just experimenting, but because, as mentioned here, we have libgccjit, we have the native compiler, we have the ability to opt, like, specifically to generate specific code for this, so why not implement the or a Redix engine itself in list, And this gives us the ability to introspect it. That's one of the things I mentioned at the beginning. But it actually gives us the ability to then actually look at all the previous implementations, to explicitly compile beforehand, to match against bytes, to specialize and dispatch based upon input and output. This is something that, you know, it's super simple. It's really smart. I'm really, really glad that Pip mentioned this because it is, I think, the right way to solve the rest of it. We're at the final section. I talked a lot about, you know, kind of abstract, you know, thoughts. I talked a little about, you know, specific solutions. But I especially talked about, you know, what is Regex and Emacs? And I don't know if I had a lot of specific examples of it. I'm going to just describe kind of my, I guess, motivation, my impetus. Then I think something that's really something to chew on for the future. Do I have any concrete examples? Yes. Well, you can decide if they're concrete. Or am I just posturing? Also, yes. helm, rg. Helm, Erg, it's literally just M-x grep, it uses ripgrep, which is written by the same author of the Rust regex [??]. It happens to be very, very fast. In particular, I use this tool with ripgrep on the Twitter monorepo, and I was able to search very, very large amounts of code that was on my local machine using regular expressions. I think this is one thing that I think is really, really important, because when you want to scale, People say the word scaling and they assume there's a specific kind of answer for that. I've just found that text is not only flexible, it's actually something that can be more performant than the alternative and not only more performant, but more productive. It's again, it's just M-x grep using ripgrep. There's a tool deadgrep by Wilfred Hughes, which is also fantastic. I think it's actually better than this, but this one's mine so I can mess around with it. But this tool is kind of why, especially I started looking into Emacs and looking into changing the way that, or at least diving into how the regular expression matching actually kind of works, both in Emacs and then in ripgrep. We'll go to the next one. This is something that does exist and continues to exist. This is something that doesn't quite exist yet. I'm calling it telepathy grams. It's, you know, it's the name, and it's very, you know, it doesn't work, but it's a code search tool that, in this case, precompiles the database to execute NFAs against. I was thinking, how can I beat And the first thing I thought is, well, as I have worked on build tools, especially in monorepos, one of the things that the pants build tool from Twitter does is it uses a file watcher to ensure that instead of having to constantly read in the entire contents of a file, which may be very, very large, it only does so when the file has been changed. Finally, I want to conclude on this note, which is just that the stuff I didn't learn from emacs devel, I learned from Paul Wankadia, Jr., who is the RE2 maintainer, and he taught me quite a lot from 2023 to 2024. I'm thankful for the time that I learned from you, so thank you, Paul. With that, we're at point-max. Call me, beat me, if you want to reach me and or hire me. These are places that you can reach me at. There are probably others. Feel free to suggest other ways to contact me. But for now, this is the end. Thank you so much for your time. I really appreciate it.

Captioner: sachac

Questions or comments? Please e-mail dmcC2@hypnicjerk.ai