Emacs regex compilation and future directions for expressive pattern matching

Danny McClanahan (they/them) - Pronunciation: məˈklænəˌhæn / mk-CLAN-uh-han, han like "hand", IRC: cosmicexplorer, @hipsterelectron@circumstances.run on mastodon - @hipsterelectron on twitter - @cosmicexplorer on github - @cosmicexplorer in #emacsconf on irc.libera.chat, dmcC2@hypnicjerk.ai

The following image shows where the talk is in the schedule for Sun 2024-12-08. Solid lines show talks with Q&A via BigBlueButton. Dashed lines show talks with Q&A via IRC or Etherpad.

Format: 20-min talk ; Q&A: IRC https://chat.emacsconf.org/?join=emacsconf,emacsconf-gen Etherpad: https://pad.emacsconf.org/2024-regex
Etherpad: https://pad.emacsconf.org/2024-regex
Discuss on IRC: #emacsconf-gen
Status: Waiting for video from speaker

Times in different time zones:
Sunday, Dec 8 2024, ~9:30 AM - 9:50 AM EST (US/Eastern)
which is the same as:
Sunday, Dec 8 2024, ~8:30 AM - 8:50 AM CST (US/Central)
Sunday, Dec 8 2024, ~7:30 AM - 7:50 AM MST (US/Mountain)
Sunday, Dec 8 2024, ~6:30 AM - 6:50 AM PST (US/Pacific)
Sunday, Dec 8 2024, ~2:30 PM - 2:50 PM UTC
Sunday, Dec 8 2024, ~3:30 PM - 3:50 PM CET (Europe/Paris)
Sunday, Dec 8 2024, ~4:30 PM - 4:50 PM EET (Europe/Athens)
Sunday, Dec 8 2024, ~8:00 PM - 8:20 PM IST (Asia/Kolkata)
Sunday, Dec 8 2024, ~10:30 PM - 10:50 PM +08 (Asia/Singapore)
Sunday, Dec 8 2024, ~11:30 PM - 11:50 PM JST (Asia/Tokyo)
Find out how to watch and participate

Description

Emacs is a delightful case study for the capabilities of regular expressions, because Emacs forms user interfaces via text, which retains the expressivity of a GUI with the user-level interactivity of written language. Because we use text for both input and output, regexps in Emacs form part of a user-level grammar for human thought. As a result, Emacs and Emacs users have a rich intuitive grasp of regular expressions, which provides a unique vantage point to consider how they may be improved in general.

When I began my investigation, I assumed that Emacs would be able to use an existing off-the-shelf regex engine, that this would be more performant than regex-emacs.c, and that the greatest challenge would be providing a sufficiently robust build process (see emacs-devel: https://lists.gnu.org/archive/html/emacs-devel/2024-04/msg00142.html). However, I quickly found that Emacs (as usual) is far more configurable than alternatives (see rust regex discussion: https://github.com/rust-lang/regex/discussions/1167#discussioncomment-8585027). Now don't get this twisted: emacs-devel was open to deprecating functionality that hampered optimization! But the biggest challenge by far is that regex-emacs.c is categorically more powerful than alternatives: it can match against non-contiguous input (across both halves of the gap buffer), as well as non-UTF8 text with its fantastic multibyte encoding (see https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html).

So a more complex picture begins to emerge: Emacs actually uses regexps far more widely and deeply than anywhere else, and its regex engine requirements aren't "legacy", but the result of caring more deeply about language than anywhere else. While regex engines have historically been known to introduce functionality not backed by formal theory that's later found to be hard to optimize, Emacs instead charts a path for other engines to follow. Formalizing backrefs is state-of-the-art, but possible (see https://jamiejennings.com/posts/2021-09-23-dont-look-back-2/), and I believe the same can be achieved for the other affordances Emacs users have come to expect. Subsequently, I have focused on identifying where we can constrain the problem space to improve performance without losing those affordances, such as explicit precompilation in lisp code (see https://lists.gnu.org/archive/html/emacs-devel/2024-08/msg00108.html).

There are many branching paths here. With the libgccjit native compiler, we can now implement regex matching in lisp itself. While `rx' can compose patterns to an extent, we could provide a more powerful primitive than regular expressions alone for complex parsing tasks. And while many regex engines employ complex optimization heuristics, we can instead introduce specific functionality for e.g. SIMD literal search into lisp code, allowing lisp users to intelligently select for themselves how and when to employ less-powerful but more-performant search routines.

We don't need to backtrack! We can try all these paths at once.

About the speaker:

Me: Typing free software to break the shoulders of giants from golden handcuffs. Work: Composeable build tools, parsing frameworks, and cryptographic messaging. All of these are misused to subjugate and oppress, but I build infrastructure for positive liberties.

This talk will cover my train of thought over the course of this year on how regex engines in general may be improved, and the discussions with emacs-devel that have helped me along. I hope this talk will convince people of the boundless future directions in text search. My PhD research will be inspired by the expressivity and power of Emacs.

Questions or comments? Please e-mail dmcC2@hypnicjerk.ai