Back to the schedule
Previous: Powering-up Special Blocks
Next: Analyze code quality through Emacs: a smart forensics approach and the story of a hack

Incremental Parsing with emacs-tree-sitter

Tuấn-Anh Nguyễn

Download compressed .webm video (26.2M)
Download compressed .webm video (21.8M, highly compressed)
View transcript

Download compressed Q&A .webm video (35.8M)
Download compressed Q&A .webm video (16.4M, highly compressed)

Tree-sitter is a parser generator and an incremental parsing library. emacs-tree-sitter is its most popular Emacs binding, which aims to be the foundation of Emacs packages that understand source code's structure. Examples include better code highlighting, folding, indexing, structural navigation.

In this talk, I will describe the current state of emacs-tree-sitter's APIs and functionalities. I will also discuss areas that need improvements and contribution from the community.

  • Actual start and end time (EST): Start: 2020-11-29T09.49.24; Q&A: 2020-11-29T10.13.56; End: 2020-11-29T10.31.44

Questions

Q20: can we integrate it with Spacemacs Python layer

Q19: The Python mode example was pretty good. Is that something that one can use already?

Yes, already using it at work right now.

Q18: Regarding Emacs integration, will it always need to be a foreign library or can it be included / linked directly in compilation?

Building a parser from source needs Node.js https://tree-sitter.github.io/tree-sitter/creating-parsers#dependencies so I don't know if it'll be in-tree and included at compile time.

Core library dynamic module, would be better to be included in core Emacs eventually. Language definitions might be better distributed separately.

Q17: Is there a link to the slides?

Yes, will post in IRC later.

Slides: https://ubolonton.org/slides/emacs-tree-sitter-emacsconf2020.pdf

Q16: Are there any language major modes that have integrated already?

Not yet (answered during talk).

Typescript: discussing integration, not integrated yet.

Q15: Is it possible to use tree-sitter for structural editing?

Covered by Q4 / Q8 / Q11.

Q14: Is there a folding mode for tree-sitter?

Not yet. There are multiple code folding frameworks inside Emacs, and it's better to integrate with these modes rather than writing something new entirely.

+1 Would be nice if it worked with outshine mode or similar.

Q13: MaxCity on IRC asks: "That pop up M-x window. How do you get that?"

ivy-posframe most likely https://github.com/tumashu/ivy-posframe/. Or not. Cool!

Custom helm code.

Q12: I'm new to the tree-sitter world. Is it easy to install/use it also on Windows? (I have to use winbloat at work)

The usual approach is hoping someone else made a precompiled version for you and download it. Otherwise you'll have to set up a development environment with mingw-msys or whatever.

  • No, both tree-sitter and tree-sitter-langs provide pre-compiled binaries for macOS, Linux, and Windows.

Yes, it should work out-of-the-box on Windows, provided that Emacs was compiled with module support turned on.

Q11: Is it possible to use this for refactoring too?

For the kind of refactoring inside a buffer, it's very doable right now with some glue code. For more extensive refactoring where you want to touch all files in a project, there needs to be some kind of understanding of the language model system, how they are laid out in the filesystem… even files that are not yet loaded into Emacs. That sounds like something a lot more extensive. Sounds like an IDE in Emacs.

Q10: Can language major-mode authors start taking advantage of this now? Or is it intended to be used as a minor-mode?

Minor mode depended on by the major modes.

Q9: I'm completely new to tree-sitter, how do I use it as an end user? Is there an easy example config out there by the organizer or otherwise that shows standard usage with whatever programming language? Or are we not there yet?

Answering own question: Sounds like major mode maintainers need to integrate.

Syntax highlighting is pretty easy to activate https://ubolonton.github.io/emacs-tree-sitter/getting-started/ - nice, tree-sitter-hl-mode looks easy

Need to add more examples to the documentation.

Q8: (Following on from Q4) Could there be a standardised approach to coding automatic refactorings in the future? e.g. so that whichever language mode you are using, you could see a menu of available refactoring operations?

Not sure about this. Most refactoring operations are highly specific to a class of languages. Not one single approach for all the languages, but maybe one for object-oriented languages, one for Lisp-type languages, one for Javascript and Typescript…

I meant the lisp and user interfaces being unified, not the implementations of the refactorings. But maybe it belongs in a separate mode on top. So you could have a defrefactor macro or similar.

Q7: How extensive will the compatibility be between highlighting grammars for Emacs and those for Vim/Neovim with Tree-sitter?

For the time being it looks like nvim-treesitter also uses the S-exp syntax for queries so it shouldn't be too hard. See https://github.com/nvim-treesitter/nvim-treesitter/blob/master/queries/rust/highlights.scm.

  • No effort has been spent on compatibility yet. Each editor has its own existing conventions for highlighting. Having a common set of basic "capture names" is possible, and will require efforts from multiple editor communities. (Emacs and NeoVim for now. The editor that introduced Tree-sitter, Atom, hasn't used these queries for highlighting.)

Q6: Will it ever be possible to write Tree-sitter grammars in a Lisp, or will JS be required?

The grammar part is written in JSON, you don't need to actually understand JS to write it. Using Lisp would merely give you a s-expression version, that wouldn't buy you much.

  • Ah, so all that is needed is (json-encode '(grammar …))? Great!

Q5: Could you show the source that was matched by the parser in the debug view in addition to the grammar part matched?

Q4: Could this be used with packages like smartparens that aim to bring structrual editing to non-s-expression based languages? AST-based refactoring?

It is one of the goals, but not yet achieved.

Q3: Do you think Tree-sitter would be useful for Org buffers? I can imagine it being used to keep a parsed AST of an Org buffer (e.g. like org-element's output) updated in real time.

An obstacle here is Org not having anything anywhere close to a formal grammar, so that would need to be corrected first.

FIXME: Add link to a emacs-tree-sitter project/snippet for org-mode.

Q2: Will Elisp performance be more competitive with GCCEmacs enough to make Tree-sitter in Elisp more attractive?

The point of this project is to reuse other people's efforts, not rewriting them.

It's a possibility. In terms of probability, probably not. It's a huge amount of work. The GC latency is also a fundamental issue.

Q1: Do you think that his package can be included into Emacs/GNU ELPA?

Yes, it is just matter of paperwork.

Notes

Related talks

CategoryTreeSitter

Transcript

Hello, everyone! My name is Tuấn-Anh. I've been using Emacs for about 10 years. Today, I'm going to talk about tree-sitter, a new Emacs package that allows Emacs to parse multiple programming languages in real-time.

[00:00:17.840] So what is the problem statement? In order to support programming functionalities for a particular language, a text editor needs to have some degree of language understanding. Traditionally, text editors have relied very heavily on regular expressions for this. Emacs is no different. Most language major modes use regular expressions for syntax-highlighting, code navigation, folding, indexing, and so on. Regular expressions are problematic for a couple of reasons. They're slow and inaccurate. They also make the code hard to read and write. Sometimes it's because the regular expressions themselves are very hairy, and sometimes because they are just not powerful enough. Some helper code is usually needed to parse more intricate language features. That also illustrates the core problem with regular expressions, in that they are not powerful enough to parse programming languages. An example feature that regular expressions cannot handle very well is string interpolation, which is a very common feature in many modern programming languages.

[00:01:31.680] It would be much nicer if Emacs somehow had structural understanding of source code, like IDEs do. There have been multiple efforts to bring this kind of programming language understanding into Emacs. There are language-specific parsers written in Elisp that can be thought of as the next logical step of the glue code on top of regular expressions, moving from partial local pattern recognition into a full-fledged parser. The most prominent example of this approach is probably the famous js2-mode.

[00:02:06.479] However, this approach has several issues. Parsing is computationally expensive, and Emacs Lisp is not good at that kind of stuff.

[00:02:16.800] Furthermore, maintenance is very troublesome. In order to work on these parsers, first, you have to know Elisp well enough, and then you have to be comfortable with writing a recursive descending parser, while constantly keeping up with changes to the language itself, which can be evolving very quickly, like Javascript, for example.

[00:02:39.360] Together, these constraints significantly reduce the pool of potential maintainers. The biggest issue, though, in my opinion, is lack of the set of generic and reusable APIs. This makes them very hard to use for minor modes that want to deal with cross-cutting concerns across multiple languages.

[00:02:59.920] The other approach which has been gaining a lot of momentum in recent years is externalizing language understanding to another process, also known as language server protocol.

[00:03:12.239] This second approach is actually a very interesting one. By decoupling language understanding from the editing facility itself, the LSP servers can attract a lot more contributors, which makes maintenance easier.

[00:03:27.189] However, they also have several issues of their own. Being a separate process, they are usually more resource-intensive, and depending on the language, the LSP server itself can bring with it a host of additional dependencies external to Emacs, which may be messy to install and manage.

[00:03:50.640] Furthermore, JSON over RPC has pretty high latency. For one-off tasks like jumping to source or on-demand completion, it's great. But for things like code highlighting, the latency is just too much.

[00:04:06.000] I was using Rust and I was following the community effort to improve its IDE support, hoping to integrate some of that into Emacs itself. Then I heard someone from the community mention tree-sitter, and I decided to check it out. Basically, tree-sitter is an incremental parsing library and a parser generator. It was introduced by the Atom editor in 2018. Besides Atom, it is also being integrated into the NeoVim editor, and Github is using it to power their source code analysis and navigation features. It is written in C and can be compiled for all major platforms. It can even be compiled to web assembly to run on the web. That's how Github is using it on their website.

[00:05:00.800] So why is tree-sitter an interesting solution to this problem? There are multiple features that make it an attractive option. It is designed to be fast. By being incremental, the initial parse of a typical big file can take tens of milliseconds, while subsequent incremental processes are sub-millisecond. It achieves this by using structural sharing, meaning replacing only affected nodes in the old tree when it needs to. Also, unlike LSP, being in the same process, it has much lower latency.

[00:05:40.639] Secondly, it provides a uniform programming interface. The same data structures and functions work on parse trees of different languages. Syntax nodes of different languages differ only by their types and their possible child nodes. This is a big advantage over language-specific parsers. Thirdly, it's written in self-contained embeddable C. As I mentioned previously, it can even be compiled to webassembly. This makes integrating it into various editors quite easy without having to install any external dependencies.

[00:06:22.880] One thing that is not mentioned here is that being a parser generator, its grammars are declarative. Together with being editor-independent, this makes the pool of potential contributors much larger.

[00:06:39.139] So I was convinced that tree-sitter is a good fit for Emacs. Last year, I started writing the bindings using dynamic module support introduced in Emacs 25. Dynamic module means there is platform-specific native code involved, but since there are pre-compiled binaries for the three major platforms, it should work in most places. Currently, the core functionalities are in a pretty good shape. Syntax highlighting is working nicely.

[00:07:12.560] The whole thing is split into three packages. tree-sitter is the main package that other packages should depend on. tree-sitter-langs is the language bundle that includes support for most common languages. And finally, the core APIs are in the package tsc, which stands for tree-sitter-core. It is the implicit dependency of the tree-sitter package. The main package includes the minor mode tree-sitter-mode. This provides the base for other major or minor modes to build on. Using Emacs's change tracking hooks, it enables incremental parsing and provides a syntax tree that is always up to date after any edits in a buffer. There is also a basic debug mode that shows the parse tree in another buffer.

[00:08:10.080] Here is a quick demo. Here I'm in an empty Python buffer with tree-sitter enabled. I'm going to turn on the debug mode to see the parse tree. Since the buffer is empty, there is only one node in the syntax tree: the top-level module node. Let's try typing some code. As you can see, as I type into the Python buffer, the syntax tree updates in real time.

[00:09:19.120] The other minor mode included in the main package is tree-sitter-hl-mode. It overrides font-lock mode and provides its own set of phases and customization options It is query-driven. That means instead of regular expressions, it uses a Lisp-like query language to map syntax nodes to highlighting phrases. I'm going to open a python file with small snippets that showcase syntax highlighting. So this is the default highlighting provided by python-mode. This is the highlighting enabled by tree-sitter. As you can see, string interpolation and decorators are highlighted correctly. Function calls are also highlighted. You can also note that property accessors and property assignments are highlighted differently. What I like the most about this is that new bindings are consistently highlighted. This included local variables, function parameters, and property mutations.

[00:10:45.760] Before going through the tree queries and the syntax highlighting customization options, let's take a brief look at the core data structures and functions that tree-sitter provides. So parsing is done with the help of a generic parser object. A single parser object can be used to parse different languages by sending different language objects to it. The language objects themselves are loaded from shared libraries. Since tree-sitter-mmode already handles the parsing part, we will instead focus on the functions that inspect nodes, and in the resulting path tree, we can ask tree-sitter what is the syntax node at point. This is an opaque object, so this is not very useful. We can instead ask what is its type. So its type is the symbol comparison operator.

[00:12:08.959] In tree-sitter, there are two kinds of nodes, anonymous nodes and named nodes. Anonymous nodes correspond to simple grammar elements like keywords, operators, punctuations, and so on. Name nodes, on the other hand, are grammar elements that are interesting enough on their own to have a name, like an identifier, an expression, or a function definition. Name node types are symbols, while anonymous node types are strings. For example, if we are on this comparison operator, the node type should be a string. We can also get other information about the node. For example: what is this text, or where it is in the buffer, or what is its parent.

[00:13:43.199] There are many other APIs to query our node's properties. tree-sitter allows searching for structural patterns within a parse tree. It does so through a Lisp-like language. This language supports matching by node types, field names, and predicates. It also allows capturing nodes for further processing. Let's try to see some examples. So in this very simple query, we just try to highlight all the identifiers in the buffer. This s side tells tree-sitter to capture a node. In the context of the query builder, it's not very important, but in normal highlighting query, this will determine the face used to highlight the note. Suppose we want to capture all the function names, instead of just any identifier. You can improve the query like this. This will highlight the whole definition. But we only want to capture the function name, which means the identifier here. So we move the capture to after the identifier node. If we want to capture the class names as well, we just add another pattern.

[00:16:10.079] Let's look at a more practical example. Here we can see that single-quoted strings and double-quoted strings are highlighted the same. But in some places, because of some coding conventions, it may be desirable to highlight them differently. For example, if the string is single-quoted, we may want to highlight it as a constant. Let's try to see whether we can distinguish these two cases. So here we get all the strings. If we want to see if it's single quotes or double quote strings, we can try looking at the first character of the string-- I mean the first character of the node-- to check whether it's a single quote or a double quote. So for that, we use tree-sitter's support for predicates. In this case, we use a match predicate to check whether the string-- whether the node starts with a single quote. And with this pattern, we only capture the single-quotes strings. Let's try to give it a different face. So we copy the pattern, and we add this pattern for Python only. But we also want to give the capture a different name. Let's say we want to highlight it as a keyword. And now, if we refresh the buffer, we see that single quote strings are highlighted as keywords.

[00:19:14.400] The highlighting patterns can also be set for a single project using directory-local variables. For example, let's take a look at Emacs's source code. So in Emacs's C source, there are a lot of uses of these different macros to define functions, and you can see this is actually the function name, but it's highlighted as the string. So what we want is to somehow recognize this pattern and highlight it. Highlight this part with the function face instead. In order to do that, we put a pattern in this project's directory-local settings file. So we can put this button in the C mode section. And now, if we enable tree-sitter, you can see that this is highlighted as a normal function definition. So this is the function face like we wanted. The pattern for this is actually pretty simple. It's only this part. So if it's a function call where the name of the function is defun, then we highlight the defun as a keyword, and then the first string element, we highlight it as a function name.

[00:21:35.360] Since the language objects are actually native code, they have to be compiled for each platform that we want to support. This will become a big obstacle for tree-sitter adoption. Therefore, I've created a language bundle package, tree-sitter-langs, that takes care of pre-compiling the grammars, the most common grammars for all three major platforms. It also takes care of distributing these binaries and provides some highlighting queries for some of the languages. It should be noted that this package should be treated as a temporary distribution mechanism only, to help with bootstrapping tree-sitter adoption. The plan is that eventually these files should be provided by the language major modes themselves. But in order to do that, we need better tooling, so we're not there yet.

[00:22:40.240] Since the core already works reasonably well, there are several areas that would benefit from the community's contribution. So tree-sitter's upstream language repositories already contain highlighting queries on their own. However, they are pretty basic, and they may not fit well with existing Emacs conventions. Therefore, the language bundle has its own set of highlighting queries. This requires maintenance until language major modes adopt tree-sitter and maintain the queries on their own. The queries are actually quite easy to write, as you've already seen. You just need to be familiar with the language, familiar enough to come up with sensible highlighting patterns. And if you are a maintainer of a language major mode, you may want to consider integrating tree-sitter into your mode, initially maybe as an optional feature. The integration is actually pretty straightforward, especially for syntax highlighting. Or alternatively, you can also try writing a new major mode from scratch that relies on tree-sitter from the very beginning. The code for such a major mode is quite simple. For example, this is the proposed wat-mode for web assembly. The code is just one page of code, not a lot. You can also try writing new minor modes or writing integration packages. For example, a lot of packages may benefit from tree-sitter integration, but no one has written the integration yet.

[00:25:02.960] If you are interested in tree-sitter, you can use these links to learn more about it. I think that's it for me today. I'm happy to answer any questions.

Sunday, Nov 29 2020, ~ 9:56 AM - 10:46 AM EST
Sunday, Nov 29 2020, ~ 6:56 AM - 7:46 AM PST
Sunday, Nov 29 2020, ~ 2:56 PM - 3:46 PM UTC
Sunday, Nov 29 2020, ~ 3:56 PM - 4:46 PM CET
Sunday, Nov 29 2020, ~10:56 PM - 11:46 PM +08

Back to the schedule
Previous: Powering-up Special Blocks
Next: Analyze code quality through Emacs: a smart forensics approach and the story of a hack