Back to the schedule
Previous: Opening remarks day 2
Next: Tree-edit: Structural editing for Java, Python, C, and beyond!

Optimizing Emacs Lisp Code

Dmitry Gutov

Q&A: live
Duration: 35:35

If you have questions and the speaker has not indicated public contact information on this page, please feel free to e-mail us at emacsconf-submit@gnu.org and we'll forward your question to the speaker.

Talk

00:01 Introduction 02:36 Emacs Lisp is a little old 04:19 Benchmark then optimize, not vice versa 05:03 profiler-start 09:31 elp - Emacs Lisp Profiler 13:01 benchmark 19:13 Write less code 20:00 Reduce allocations 22:52 Recent optimizations in Xref 30:52 cl-lib, dash, and seq

Q&A

00:00 Why are overlays slow compared to text properties? 03:21 Would these optimizations be helpful in a personal init.el? 04:28 What's a good approach for benchmarking destructive operations? 06:06 Do you recommend avoiding cl-defstruct in favour of "pure" lists/vectors? 08:20 Possible to optimize Emacs packages with code compiled from other languages? 10:26 (Q&A logistics) 12:25 What about text properties vs. buffer-local variables to store position cache? 18:40 Followup on earlier cl-defstruct benchmark discussion RE: AVL trees 30:48 Does cl-defstruct have memory overhead/general memory usage discussion

Description

Name pronunciation: d-MEET-ri GOO-tov
Pronouns: he/his
Homepage: https://github.com/dgutov/
Preferred contact info dgutov@yandex.ru
  • Before optimizing, benchmark first.
  • Different benchmarking approaches.
  • Live evaluation, step-debugging, measuring from a debugger breakpoint.
  • How to determine if a function is expensive. How to pick one from competing alternatives (cl-lib, seq, dash, lean core).
  • Print-benchmarking.
  • Byte-compiled code can give a very different picture, changing where the bottleneck is. How to quickly load a byte-compiled version.
  • Steps taken to speed up the Xref package recently.

Discussion

IRC nick: dgutov

Pad:

  • Q1: Why are overlays slow compared to text-properties? I (believe to) understand that it is (in part only?) due to "get element n in vector vs list". If so, then why don't we change that? There could be a text-property called "overlays", so that lookup would also be like in a vector. What benefits does the datastructure currently used for overlays have that make that undesirable? Would a mixed approach make sense; i.e. allow non-modifiyng lookups to use the "cached" overlays that are stored in the "overlay" text-property and make text-inserting and overlay-moving actions store in the currently used datastructure as well as in the indirect text-property=>overlay cache?
  • Q2: As a non-programmer, would these sorts of optimizations be helpful to do on a personal init.el file?
    • A: Probably not
    • Though too much mode-line customisation may slow things down.
  • Q3: What's a good approach for benchmarking destructive operations?  If you delete elements from a list in-place, all subsequent runs will be artificially fast.
    • A: There is an example of a comparison between operations from different libraries in the example file provided by the talk. Particularly check the benchmarks for delete and remove operations (destructive and non-destructive, respectively).
  • Q4:Cl-lib constructors, getters, and setters usually expand into multiple levels of let-bindings. AFAIU, every let-binding is an extra memory allocation. Do you recommend avoiding cl-defstruct in favour of "pure" lists/vectors?
    • A: basically no. if defstruct allows you to organise better, go ahead.
  • Q5: Is it possible to optimize some emacs packages by making use of code compiled from other languages (like C or Common Lisp) ((i.e. in the same way python is able to import C code))?
    • A: yes emacs modules allow you to run C or Rust, transitioning between emacs proper and module (passing the data) might slow things down? Because of copying that's necessary to avoid issues with gc.
  • Q6:You mentioned that overlays are much slower compared to text properties. What about text properties vs. buffer-local variables to store position cache?
    • A: I haven't measured it but offhand I'm going to guess that buffer-local variables will be faster.
    • Also depends on the structure you're going to use for the cache - is it a single cons, or a list, or a tree, etc.

BBB:

  • AVL tree
  • defstruct accessors should expand with compiler macros to aref calls, which are very fast
  • They have extra if though
  • oh you mean for testing whether the value is such a struct?
  • yes there is that test, but I wouldn't expect that to make it 3x slower, AFAIK

IRC:

  • If somebody wants to do a remote session with me: I do have processes such as updating column view dynamic blocks that take maybe 40 minutes. So far, I avoid executing those functions when I'm around the computer myself. However, there may be room for improvement and I really can't tell wether it is in my personal setup or not because it's not always that easy to re-create a use-case with plain Emacs cnofig
  • Thanks for doing this talk. FYI you might find the this bench-multi-lexical macro useful: https://alphapapa.github.io/emacs-package-dev-handbook/#outline-container-Optimization
    • dgutov: I can't seem to find the exact macro you are referring to. But if it covers a use case benchmark-progn does not, consider contributing it to benchmark.el in the core.
    • Sorry, try this link directly to that macro: https://github.com/alphapapa/emacs-package-dev-handbook#bench-multi-lexical The purpose of the macro is to compare different forms and show how they perform relative to each other
    • dgutov: Ah yeah, that looks pretty cool. Less sure about the org format, but it must be nice for presentations.
    • The Org format is good for documentation too. But it just uses the output of benchmark-run, so it could easily be left in Lisp form. :)
    • dgutov: These things are really handy to have available in 'emacs -Q', though. When you're working on shaving some extra handful of percents.
    • Yes, a few lines of code could be added to run the compiled code in a separate Emacs process.
  • https://github.com/alphapapa/emacs-package-dev-handbook compares some common ways to do common things in Elisp so you can see which is generally faster, e.g. https://github.com/alphapapa/emacs-package-dev-handbook#inserting-strings
  • PSA: buffer-local-value is generally much faster than with-current-buffer if all you need to do is get the value of a variable in a buffer
  • For more info about the performance of overlays vs text properties data structure, there's an Emacs TODO about it. C-h C-t and search for "Move overlays to intervals.c".
  • cl-defstruct getters/setters have compiler macros that expand into simple aref calls on vectors, they are very efficient

Links:

Transcript

[00:00:01.120] Hi. Greetings from the cloudy St. Petersburg. My name is Dmitry Gutov. I write Ruby by day, Emacs Lisp by night when I don't do anything else. You might know my work from a number of third-party packages as well as built-in ones. The idea for this talk came out from an improvement request for the performance of grep-like commands using the xref interface and its storage types, container types for the search results, complaining that it's not as fast as it potentially could have been when there are lots of search results coming for a given search. I have noticed myself that when working on it, and probably before. My approach to optimizing Lisp code has changed recently-ish, and I'd like to talk about it here. This talk is for people who already know how to write some Emacs Lisp to solve their problems and who have possibly encountered some performance issues doing that. Also, if you want to contribute some improvements to the code that is already in Emacs or to other packages which are owned by somebody else, it should also be helpful. Let's start. First of all, about Emacs and Emacs Lisp. It's not the fastest language. Let's switch to the notes. I hope the font is big enough to be readable on this video, really hope.

[00:02:36.480] Emacs Lisp is not the fastest of the bunch. The garbage collector creates pauses whenever it needs to sweep data. The interpreter is not the fastest one, even though the native compilation branch is improving on that by twice in certain scenarios, which is good. The standard library or functions for working with collections, for example, are not so uniformly great. So there is some work to be done there. Maybe you can contribute the next improvement in there. And also, if your package displays stuff, it has a visual component, then you might have to deal with some drawbacks of the display engine which might slow to a crawl when your code has... when you're trying to display lines which are a little too long-- trying to print a long line, for example-- or you are using a lot of overlays, if you know what it is. With lots of overlays on the same visual line, in particular.

[00:04:19.840] But okay. The first thing to understand, and I hope everybody who's done some programming in the past knows, it's: first you write correctly, and then you try to benchmark. and see where the problems are, if there are any. So first do it right, then do it fast. How do we find the hotspots, the bottlenecks on the second step? We try to do some profiling or measuring how long the code takes. Emacs has two different profilers.

[00:05:03.039] One is the native profiler, which as I recall was contributed by Tomohiro Matsuyama, the author of the autocomplete package back in Emacs 24.3, apparently. It's a low overhead profiler. It's sampling, which is good on one hand but might be less good on the other hand. Let's try using it. So we start with profiler-start. Let's just ask for CPU profiling here, but it also can profile memory locations. Let's try a search for some string in the current project. Let's do it a few times, maybe three times, so the profiler has more data to work with. Let's call the report. Okay. So here we have the tree, the execution tree with percentages of the time spent in each function. You can unwrap it by going up and down and pressing TAB to unwrap every element. This is weird. Okay, here we see that the actual command that was called only takes 8% of the whole runtime, meaning the input was... The command took not enough time for us to really dig into it. Let's try a shorter input with more matches. So profiler-start again, CPU. Let's search for list. Don't mind the minibuffer just yet. Okay. So let's look at the report. We can unwrap it here, and we see 52% of the time was spent doing this, at least according to the profile. We can unwrap it, see the description of any function that was called by hitting d, or even jump to it by tapping f. By going down the stream, we unwrap it with TAB, you can see where time was spent. One of the bigger drawbacks of this profiler is the arithmetics don't always work out, like you might have... This is not a good example, but okay, you have 14% spent in here, but when we expand this entry, we only see like 6%, 3%, and 0%. Different sum, sometimes even bigger than that. So the native profiler can give an accurate picture and it has little overhead, but the specific numbers are not very precise because the principle is probabilistic. Let's stop here.

[00:09:31.200] There is another package called elp, Emacs Lisp Profiler, which is much older than that. It allows us to instrument just specific functions or a package. We're instrumenting the xref package here. It works through advice. You can see the description of one of the functions in the package, and we see that it has had :around device added. If we run the same search, we can see that -- when it finishes -- the table of all the numbers, that every function in the package has been called, and the times which the runtime has spent inside of them. sorted by impact, as it understands that. The main problem with this profiler is it is slower because it adds overhead to every function call, so it, in the end, might give an impression that functions with lots of calls, which have been called a lot of times, are more important, hotter than they actually are, because it slows down every such function call, including the subsequent calls that these functions do inside the same package. So it's a good way to analyze and drill down to see which functions here take a lot of time, but just keep in mind that sometimes you might end up trying to optimize one of these smaller functions called or usually smaller, and that the result might not actually affect the production runtime at all. But it's still a good tool, especially when you already know which set of actions you are interested in, and which functions might be slower. elp allows you to instrument a package or a set of functions. Just don't forget to un-instrument them all afterwards, or else your session, your subsequent optimization efforts might not work as well as you might want to work.

[00:13:01.360] And there's also this very nice, very handy package called benchmark, which unfortunately not everybody knows about. It turns out that a lot of older Emacs Lisp developers, users, package developers usually have some benchmarking macros of their own. But there is this package with perfectly usable interface which doesn't require you to define something else, especially when you are in an emacs -Q session trying to benchmark your code in a bare Emacs. So it has two main endpoints, I would say. First one is the benchmark macro, and the second one is benchmark-progn. benchmark is a function, and benchmark-progn is a macro. The first one you can use by specifying the number of iterations and the form. I hope the minibuffer is easy to read here. For instance, we can take this long list and try nreverse-ing it, and we see how long that takes. Then you can do it, adjust the inner code, and then basically compare to figure out how much and how long nreverse takes in this scenario. Or, we can take a function which we have probably found using one of the previous methods which we anticipate that, as we understand, it takes a while, and annotate it with a benchmark-progn form, which... just execute the body and report, each and every one of them, how long that body execution took. So for instance, here we added a few message calls inside those benchmark-progns, so we can see how long each part of the function xref-matches-in-files takes when it is run. Let's try it. Let's first call emacs-lisp-byte-compile-and-load, so that we're sure that we are running the fastest possible version of this code, so when we do find the bottlenecks, we are sure that they are real and not because of the uncompiled code, macro expansion, or the lack of some substitutions, other substitutions that our code is relying on but which might not be available in the interpreter mode just in the compiled code. Let's run. So we have this list, search for list, and the remaining time is spent during printing of the results, which we didn't annotate with the benchmarking macro, but it's still there. So we can see by switching to the Messages buffer, C-h e, that unquoting was very fast. So the search took 400 milliseconds. It's slower than it would be without the video being recorded, as well as the rest of the measurements here. So the parsing of the results took more than that, and the object allocation took even more, which is unfortunate but it's the reality when we're dealing with a lot of search results, because, well, Emacs Lisp is slower than grep. That's just a given. What can be done and what had been done to improve on this? Well, first of all, let's change the question, because this is more of a retrospective sort of talk rather than talking about future plans.

[00:19:13.440] What can one do to improve performance? Well, basically, two things: first, you try to make sure that you're writing less code, then your code does fewer things, Fewer iterations of your operations. Basically, it's the realm of choosing your algorithm well. But make sure it's as little complexity as you can manage reasonably.

[00:20:00.240] Another is to try to reduce memory allocations. It's creating conses, the links of the list, of the lists. If you are mapping through a list, creating a new one, you are creating conses. New objects in memory, anyway. And also, when you call string operations like substring, when you create new strings, that's also what you do. When you allocate new memory, that triggers garbage collections later, which affect the performance of your code quite severely. I have found that actually, contrary to my personal intuition, when you try to fine-tune the code working on long lists of elements, long, long collection, large collections, it's even more important to try to avoid or reduce memory allocations rather than try to ensure that less code is running, less operations are performed, because the garbage collector can hit you in the posterior quite suddenly that you will not always... When you measure the input impact, you might not always measure the whole of it. And sometimes even when you have a few operations, you can measure all three of them, but at some point, if you just reduce a lot, remove most of the allocations, all three of them, the improvement might be even bigger than the total of three improvements which you might have measured previously, like separately. So it's something to be on the lookout for, but of course, when you pick the algorithm, it's important not to do more operations than you have to.

[00:22:52.159] We have examples of both in the recent changes to the xref and project packages, which we can examine here. Let's take a look at this one. This commit message lies a little bit because it's referring to the use of assoc instead of cl-assoc, and the actual change incorporates that, but it also incorporates the second part of the sentence which actually replaced the use of assoc in there. Curiously, cl-assoc was pretty slow because not only it created quadratic complexity in the operation and it was also calling the Lisp function equal for every iteration, whereas if we just use assoc there, which was the first version of this change, of the improvement, it became already much faster, but then switching to a hash table which turned this lookup from O(n) complexity into, well, amortized constant one, even better. So, use hash tables, kids. Another commit here is about using the inhibit-modification-hooks. So, turns out when you're printing into a buffer, even if you have already disabled the undo history, binding this variable to a non-null value is pretty good because you are able to avoid running a number of hooks, which improves performance. Next. This one was about moving the file-remote-p call from inside the loop. This function is actually surprisingly slow-ish for the goal, for its purpose, so you don't really want to call it on every file name in the list when you have a lot of them, and especially if your code is running in a buffer visiting a remote file, like through TRAMP. You might end up trying to devise different approaches to avoid checking whether every file name is remote if you're dealing with a list of file names, so this one, take a look, be careful with it. A similar, slower function, with-current-buffer, but not so much. So it all depends on really how often you call it. Sometimes you might want to rewrite your code so that it's called less often. And expand-file-name, which hits the disk. So if you're dealing with file names, you might want to replace it with concatenation at certain points. Okay, back to these changes later. This one just removed a location of a cons per match hit, and still it brought 5% or something like that improvement to the whole command. So that's a pretty significant improvement for a small change like that. Similarly, here we just made sure to avoid a splits... no, a substring call, and probably an allocation of the whole buffer string, but in my testing, that doesn't actually matter much, at least so much, but a substring call per result... If we see, since we changed to manual parsing of the buffer, with the strings delimited with zero bytes, it gave an overall improvement of 20%, again on my machine with a pretty fast SSD, and with a warm disk cache, of course. But still... Going back to this revision, it was actually quite surprising that migration to a cl-defstruct from eieio, the Common Lisp-inspired object system first introduced in the CEDET tools, that was a bit of a surprise, because not much of my benchmark benchmarking actually pointed at it being the problem. Probably because the accessors were not the actual problem, like the oref macros and the code accessing the slots, but the construction, the object construction code, that was where most of the time was spent unnecessarily, maybe doing type-checking, maybe some other stuff. So if you have lots of values, you need to treat like objects in... and virtual dispatch on them in your package, you might want to look into cl-defstruct for them.

[00:30:52.240] Going on to the next section, I have prepared the sort of comparison between cl-lib, dash, and seq, the collection libraries we have available for us in Emacs. That is the popular ones, but since I'm running behind on time, I'll probably just summarize the findings. First of all, seq is nice. Its generic approach is probably quite decent for most of the situations, but there are places where it could be optimized better, so instead of having quadratic performance, it could use a hash table, like for instance, dash does here-- in dash, union or delete-dups does in its implementation. The dash itself is curiously fast, at least faster than I might have expected, possibly because of the implementation approach where it uses code generation to avoid function calls, at least some of them, which is interesting. But since both seq and dash avoid mutations--they don't really have mutating counterparts to common functions like you have with cl-remove-if, cl-delete-if, or just cl-remove, cl-delete, it still can be valuable to look into destructive versions of those functions, something from the core library like delete-dups or nreverse, for your code when you're really trying to get as close to the metal or whatever as you can, because avoiding extra allocations, it can really be useful. You can really improve their performance if you don't do a lot of other stuff. delete-consecutive-dups is blazing faster. It only requires pre-sorted strings. What else to say... If you are going to read these measurements, make sure to keep in mind that reverse is not free, so for instance, if we're looking at this comparison between remove and delete, for instance, they're using reverse to avoid modifying the data, the sample data, so we don't have to create it every time. But to compare how much faster delete is than remove, we need to subtract 787 milliseconds from here and from here so it comes out to like 230 milliseconds in this example, the last example, and 100 to 1 second, 250 milliseconds here, so the difference is 5-fold here. Not 2-fold. All right. With this, I'm going to thank you for listening, for watching and I'll be taking questions. Thank you. captions by sachac

Back to the schedule
Previous: Opening remarks day 2
Next: Tree-edit: Structural editing for Java, Python, C, and beyond!