Writing a Language Server In OCaml for Emacs, fun, and profit

Austin Theriault (he/they) - last name prounounced tare -e -o, austin@cutedogs.org

Format: 17-min talk; Q&A: BigBlueButton conference room
Status: Q&A to be extracted from the room recordings

Talk

00:00.000 Introduction 00:16.540 What is Semgrep? 00:40.720 How do we show security bugs early? 01:37.880 What is the Language Server Protocol? 02:29.040 Case study: Rust Analyzer 03:42.760 Rust Analyzer in action 04:09.960 Why is this useful? 05:36.220 So what about Emacs? 06:40.700 Technical part - Brief communication overview 07:58.760 Example request 08:03.380 LSP capabilities 09:23.380 Tips on writing a LS 11:03.480 Supporting a LS through LSP mode in Emacs 12:06.000 Create a client 13:07.300 Add to list of client packages 14:11.680 Add documentation! 14:17.880 Adding commands and custom capabilities 15:01.360 Thanks for listening

Duration: 16:04 minutes

Q&A

Listen to just the audio:
Duration: 14:24 minutes

Description

Recently, while working at Semgrep, Inc. I wrote a language server for our SAST tool in OCaml: https://github.com/returntocorp/semgrep/tree/develop/src/language_server. I then added support for it to emacs https://github.com/emacs-lsp/lsp-mode/blob/master/clients/lsp-semgrep.el. In this talk I plan to go over what LSP is, why it's important, getting started writing a language server, and supporting a language server in Emacs.

About the speaker:

Austin Theriault is a software engineer at Semgrep, Inc. working on their SAST tool Semgrep. In this talk he will cover the Language Server Protocol, a way to provide language features to an editor, why it's important to the future of editors, and how someone might go about writing a server, and how to integrate it with Emacs.

Discussion

Questions and answers

  • Q:Why not write the LSP server in OCaml? I missed the reasoning to switch to Rust/etc - performance?
    • A: The "stack" (cross-compilation, libraries, etc.) being less developed than for developing LSP servers in, e.g., TypeScript
  • Q: What are the corner cases, limitations, and other issues you encountered in implementing an LSP server with client in Emacs, that were surprising?
    • A: Multiple, but performance being the big one. Caching implementation. And then delivery/distribution (doing so cross-platform given the OCaml tooling, etc.)

Transcript

[00:00:00.000] Introduction

Hi, I'm Austin Theriault, and this is writing a language server in OCaml for Emacs, fun, and profit. Real quick, who am I? Well, I'm a software engineer at Semgrep. I work on our editor integrations, and I love working on programming languages, editors, and cryptography.

[00:00:16.540] What is Semgrep?

What is Semgrep? We're a small cybersecurity startup whose core product is a SaaS tool, which is static application security testing. You can think of it as like a security linter. Normal linters will say, hey, you wrote ugly code, fix it. We'll say, hey, you wrote a SQL injection, fix that. We support 30+ languages, and we have lots of customers all using different IDEs. Why does that matter?

[00:00:40.720] How do we show security bugs early?

Well, our goal is to show security bugs as early as possible in the development cycle. In the industry, we call this shifting left. And so how far left can we shift? The editor. So that's why it matters that our customers have different editors. Our goal is to have Semgrep and the editor show up like other language tooling. And what I mean by that is I wrote some bad OCaml up here, and the editor gave me that red squiggly and said, fix your OCaml, and we want Semgrep to do something similar. And so our goal then is to provide a similar experience to normal language checking. And then since we're a small startup, and there's a ton of different IDEs that our customers use, ideally, we don't want to have to rewrite a plugin for every single type of editor out there. Our other goal is abstract away editing and language features for editors to one code base. Ideally, we write it once and then plug it into all of them. So how can we do that, though?

[00:01:37.880] What is the Language Server Protocol?

Well, in the process of working on this stuff, I found out about the Language Server Protocol. And what's great about the Language Server Protocol is it's a specification that defines all the ways that these language tools might interact with a development tool. And by development tool, I mean like VS Code, Sublime, Emacs, any of those. And by language tool, I mean something like PyRight, MyPy. So what's cool about LSP is that you can separate out those tools into language servers and the development tools into language clients. And because they share this common specification, they can now interact without knowing each other. So it's this great abstraction that means all you have to do is go write one language server and you can hook it up to a bunch of language clients and it'll just work.

[00:02:29.040] Case study: Rust Analyzer

So let's do a quick case study on language servers in LSP, just so you get an idea of why this is super cool. So there's this language server called Rust Analyzer. It's a language server for the Rust language. If you've ever developed in Rust, you'll know that takes a really long time to compile, but the compiler gives you fantastic feedback. Rust has a lot of advanced language features, so that feedback is super important for developing. And so Rust Analyzer will give you that feedback instantly. Here's a ton of things that it gives you. Code completion, fixes, compiler errors, warnings, type signatures. Rust has a pretty strong type system. It also has this thing called lifetimes. A bunch of advanced language features in Rust Analyzer helps you manage all that and gives you all that info without having to wait for it to compile. Developing with the Rust Analyzer is just orders of magnitude easier than just trying to write Rust straight. Rust Analyzer, fantastic. They went and they developed it, and now you can go use that in Emacs, NeoVim, VS Code, wherever. So you can develop Rust in a way that's relatively efficient without having to give up your favorite editor.

[00:03:42.760] Rust Analyzer in action

So here's a quick little demo of all the cool things it can do. So you can see I typed an error. It tells me that I wrote an error. I used the incorrect lifetime, which is some advanced language feature, and it'll let me know that. I expanded a Rust macro just there, which is similar to Lisp macros, and then I ran a single unit test, and that's really cool because I ran a single unit test from my editor. I didn't have to go and type any commands or anything. It just worked.

[00:04:09.960] Why is this useful?

So why is this just useful in general for a user? Well, you get the same experience across editors. Like I was saying, you don't have to give up one editor for another so you get some sort of cool language feature. You can easily set up and use language servers made for other editors if developers don't support your editor of choice. Performance is not dependent on the editor. That's fantastic because to do all that Rust stuff, it takes a lot of CPU power, and so that's going to be slow if your editor language is not great, not fast. And then bug fixes, updates, all that, it all comes out at the same time. And then from the developer perspective, well, adding new editors is quick and easy. For reference, when I wrote the Semgrep language server, it took me maybe two or three weeks, but then actually going and setting it up for VS Code, that took an hour. For Emacs, 30 minutes. IntelliJ, maybe another hour. So it took me a day to add support for three different editors, which was I think something like 75% of the market share or something crazy like that. So very quick. You only need one mental model. You don't have to figure out all these different extension mental models, how those editors work, anything like that. And another thing that's cool is you only have to write tests for the language server, not necessarily for the editor. It's great to have just one set of tests that you have to pass.

[00:05:36.220] So what about Emacs?

So why does a language server protocol matter with Emacs? Well, like I was saying before, Emacs gets the benefit from work put into other editors. So we get all this language support, and no one actually has to go and write the list for it or write those tools specific to Emacs. You get the language tooling, the CPU-intensive part of the editors. It can be written in something else. Lisp is fast. It's not that fast. Having that speed is fantastic. It's all asynchronous. It won't slow down Emacs. And then there's this package called lsp-mode, which is an LSP client commonly included in popular Emacs distributions. So a lot of people already have that. If you're using Emacs 29 or greater, you have eglot-mode, which is a lighter weight version of lsp-mode. It's just another LSP client. When I wrote the Semgrep language server, Emacs 29 hadn't come out yet. I'm not going to talk too much about eglot-mode because I did everything in lsp-mode, but I would imagine a lot of this stuff is very similar. Here's a list of some supported languages.

[00:06:40.700] Technical part - Brief communication overview

Now let's get into the technical part. How does LSP actually work? So let's go over how it communicates first. It uses JSONRPC, which is just kind of like HTTP, but instead of sending plain text, you're sending JSON. So it's just sending JSON back and forth. It's great because it's a way for two programs to communicate without sharing a common programming language. Transport platform agnostic, so it could be stdin, stdout, sockets, whatever. It's just JSON. You can send it over whatever. There's two different types of messages, a request, which requires a response from the other party, and a notification, which does not expect a response. So just a quick little example, a user might open a document, and then it'll send like a text document did open and what document it was to the language server, and then they'll change it. Maybe they edit some code and introduce a syntax error. The changes will be sent to the language server, and then the language server will publish diagnostics, which is those red squigglies I was talking about earlier, and say, hey, syntax error or whatever here, or maybe the user says, I want to go to the definition of this function, and then the language server will spit back, hey, this is where that function lives. All very useful, and the communication is relatively simple, which is great.

[00:07:58.760] Example request

This is what it looks like, what a request looks like. Notifications look somewhat similar.

[00:08:03.380] LSP capabilities

So now we know how LSP communication works, but how does the actual protocol work? Well, almost all of the protocol is opt-in, meaning you don't have to support the entire specification, you can just pick and choose. Servers and clients will then communicate what part of the protocol they both support, so they'll both say, hey, we support being notified when a user opens a document, or if they're looking for documentation. And so then once they agree upon what they'll both support, then they'll send that stuff, those notifications and requests back and forth. Things like opening and closing files, diagnostics, code completion, hovering over stuff, type signatures, all of that. And what's cool is even though the specification is huge and probably has everything you need, you can go ahead and add custom capabilities if you really want to. So you can just define a custom method, and then now that works for you, and now you can have that in all your editors. For example, Rust Analyzer has structural search and replace, which is like find and replace, but with respect to the structure of the code. And if you choose to go down this route with the custom capabilities, you do have to remember you're going to have to implement it in every client. And that's a little bit more work, but it's better than where we were without LSP.

[00:09:23.380] Tips on writing a LS

So some quick tips on writing a language server. I'm not going to get too into this because it's very application-specific. I wrote Semgrep's in OCaml since our code base was almost all OCaml already, and I wanted to leverage that. Would not recommend unless you also have a code base all in OCaml. Structure is similar to a Rust server, so a bunch of independent endpoints. I would do everything functionally if I were you. This is EmacsConf. We're all hopefully used to writing functional Lisp. I would recommend TypeScript or Rust, though, depending on your level of performance that you really need or whatever language you're trying to support ideally. Most languages have some sort of language server protocol already. But if they don't, then it might be easier to do it in that language. TypeScript has a lot of support, a lot of documentation, a lot of examples out there because it was what Microsoft originally intended the language server protocol to be for, for VS Code, which is written in TypeScript. Rust is fast, it's going to take more effort, but it's very fast, and Rust Analyzer has a great library that they use and that they support. So support there, examples there are great. The hard part is not really the language server protocol, but the actual logic. So, like, if you're doing, like, language tooling, you're going to have to do analysis on the code, so you need to do parsing, possibly compiling, all these different advanced features, all these advanced different things. For example, Rust Analyzer will do incremental compilation, which is really, really cool, but that's, like, a whole separate talk. If you're adapting an existing language tool, this stuff is really easy. You're basically just wiring stuff up.

[00:11:03.480] Supporting a LS through LSP mode in Emacs

But, yeah. So, now we know all about LSP and language servers. Say you want to actually add support for a language server in Emacs. How do you do that? Well, let's look at LSP mode, because, like I said, this is what I'm most familiar with. I'm sure eglot-mode is pretty similar. So, lsp-mode's repository is on GitHub, like everything, and it has a ton of different clients for a ton of different languages and frameworks and tools, like Semgrep, and these are available to anyone who installs LSP mode. Alternatively, you can make a separate package and just use LSP mode as a library, but I'm not going to focus on this, because there's already a ton of resources out there on packaging and Emacs. So, our steps, very quickly, are going to look like adding an Emacs Lisp file that contains some logic, add an entry somewhere, so we added a new client to the list of clients, and then do some documentation, because documentation's great.

[00:12:06.000] Create a client

First, creating a client. In the clients/ folder in lsp-mode/, literally just add, like, lsp- whatever it is, require the library, and register a client. Registering a client just means, like, saying what kind of connection it is. It's most likely going to be standard I/O, because that's pretty easy to implement, and then you just pass it the executable that you actually want to run. Say what the activation function is, so this is when the client should start, so you can specify the language or the major mode or whatever, and now your client will start whenever that's triggered, and then finally provide just a server ID, so that way it's easy to keep track of, and then run this LSP consistency check function. This just makes sure everything up there is good. You can do more advanced stuff with making an LSP client that I'm not going to get into, but just know that these aren't your only options, and then finally provide your client.

[00:13:07.300] Add to list of client packages

Next, you just have to add your client to the list of clients that lsp-mode supports, and now you've added support for a whole new language, whole new framework, whole new tool to Emacs, and it's taking you, what, like, what is that, 20 lines of Lisp? No, not even, like, 15. 15 lines of Lisp, whole new language for Emacs. It's really exciting. Now that you have your client, let's do some documentation. Go fill out this, like, name, where the repository, the source code is, because free software is great, and you should open source your stuff. Specify the installation command. What's cool about this is this can be run automatically from Emacs, so if it's, like, pip install pyright, right, you can put that there, and Emacs will ask you, do you want to install the language server, and you can hit yes and users will just have it installed for them, and then you can say whether or not it's a debugger. This is completely separate, so there's this thing called DAP, which is the debugger adapter protocol, and it's similar to LSP but for debuggers, which is very cool,

[00:14:11.680] Add documentation!

and then finally link to your documentation. Please, please document your stuff.

[00:14:17.880] Adding commands and custom capabilities

If you want to add, like, a custom Emacs function or custom capabilities, it's super easy. It's literally just, like, calling a normal Emacs function. For example, Semgrep normally only scans files when you open them, but we added a Emacs function that will scan your entire project, right, and so that was just a client notification. It was just lsp-notify and then a custom method, and it's great because now you can just scan your project from a simple Emacs function. Requests, very similar to notifications. You send it and then pass it a lambda and do something with the result, and so that's adding custom capabilities.

[00:15:01.360] Thanks for listening

That's pretty much it. Thank you for listening. Some resources here. These links are clickable if you get the PDF, if you get the slides. Semgrep: we're hiring! If you want to work on, like, programming language theory stuff, compilers, parsers, editors, email me or go look at our jobs. The LSP specification, this is, like, the holy Bible. It has all the specs, all the types, everything. lsp-mode and the docs. lsp-mode, right, that's where you want to add your client. The docs are great, super useful. Rust Analyzer is just a great reference for language servers in general if you want to write one or if you just want to, like, see how they work. It's all just really well done. It's great code, very readable. And then down here is just a long video tutorial, a longer video tutorial, not by me, by someone else, on how to add a language client to Emacs, but hopefully this is sufficient for y'all, and now it's time for some Q&A.

Captioner: sachac

Q&A transcript (unedited)

who are currently watching, who have questions, put them into the pad that I can ask them. I'm kind of monitoring the IRC concurrently. So the first question that we have on the pad is concerning why you have switched from OCaml. Maybe the person has missed it in the talk, if you've mentioned it. Why have you switched from OCaml to, in this case, I guess, Rust? language server that I wrote mine for my company in OCaml But I wouldn't recommend it just in general unless like you're doing something specific with OCaml And the reason for that and I recommended Rust or like TypeScript is like OCaml is great. It's very performant but it's cross compilation story is not great. It's like really hard to cross compile like from 1 platform to another. And then like the ecosystem and its standard library is also not great. And like Rust, its cross compilation is great. Its ecosystem is great. OCaml is great if you need to use it, but it's just it's not ideal. And there's just also no good examples of a language server in OCaml. There's the official like OCaml language server, But they use a ton of super advanced language features, like module functors and a bunch of other random stuff. So it's not really readable. But Rust, there's Rust analyzer, which is readable. In TypeScript, there's like a million different ones. So it's less of a, not OCaml is like, it's not that OCaml isn't great. It's more of a, these other languages would probably just be easier. So. for example, like NeoVim or some other editors are just revenue fine because of the so it's a standard LSP specification that you're using. So you can also, for instance, use it and other editors, like for instance, new them or so. It's most, most editors nowadays support it. Like obviously Emacs, NeoVim, Sublime, VS code, Intel, all the IntelliJ ones. So yeah, that's, that's the fun part. You don't have to write 10 different languages to get a bunch of editor support. So I didn't have really time to hear into your talk. So I'm sorry if I ask you questions that you have already said. How was the experience of writing an LSP? So have you any knowledge beforehand or do you just read it all on yourself? which is what motivated me to do this talk. Basically, I just looked at the specification, and I knew Rust Analyzer was cool. And so I looked at Rust Analyzer, and I looked at PyRite. And I just went from there. I found out about all this because I already using Emacs, I already knew about it. I was like, this is going to be easier than something else. So yeah, there's the experience is fine. It's just a lot of wiring stuff up. It's not a lot of like hard thinking until you get to like performance heavy stuff. Like, so for some graph, like we're doing a ton of like code parsing and like analyzing. And so that's, it takes up like a ton of processing power. So like for stuff like that, like now you have to think about caching and like ordering things. So that part's hard, but that's more of a, like very much application specific thing. I think not. It's nothing I can see. No questions, that's kind of odd to be honest. I cannot really ask questions concerning LSP specific. Let's call, let's ask something very unspecific concerning the Emacs usage. And when have you started? How did you came through it and stuff like this? me and my friends just were like, got obsessed with Linux for whatever reason. And then like we traveled down like the, like the free software, like we just thought that was like very entertaining and like interesting to read about all the free software stuff. They were like, yeah, that's cool. And so we all started using Linux. And I'm like, well, if I'm using free software, I'm going to use Emacs. And so I started using Emacs just to try it out. And then I kind of got, I feel like, Stockholm syndrome into it. And now I've realized like, I don't know, now that I've done the like actual work to get into Emacs, it's just, there's so much more I can do with it. But yeah, it was somewhat unintentional. like 2 years ago using Emacs. And also just, oh, there's at first some cool people on YouTube, so systems crafters and people like this. And also, ah, VS Code, I used a lot of VS Code beforehand and then VS Codium because open source and then oh are there any other alternatives and I came to like Neovim and Emacs and often switching around but I stick to Emacs at some point to be honest. cool. I will say that. And also just like I like Vim. Vim is cool but like being able to like write lists and like modify your editor on the fly is just like very appealing to me. I don't know, Emacs was tough at first because like all the like default key bindings are just kind of like and then and then I read somewhere someone was like yeah well Richard Stallman uses evil mode so it's okay. I was like alright I can that's like blessing enough for me Like I'm just gonna switch to evil mode. And I was like, this is way, way better as far as key bindings go. I think, half a year to the default key bindings from Vim beforehand. I switched back to Evil and now I'm losing some kind of hybrid styles. It's kind of weird. But we have a question on the pad. So what are the corner cases, limitations, and other issues you encountered in implementing an LSP server with client in Emacs that were surprising? limitations are definitely like, once again, they're going to be very application specific, but it's usually just the performance part. So like I was saying before, right, in general if you're doing language tooling, you're gonna be doing either parsing or interpreting or something like that, which is very just like computationally heavy and so if you're trying to like do that stuff while someone is editing a file right like every keystrokes every like 1 to 2 seconds if they have a fast computer that's great but a lot of people don't have like that fast of a computer that they can go and like do compilation every single keystroke. So like, I would say, I would say the like limitation is just how fast your computer is and how good you are at like implementing caching for like whatever you're doing. That's also just the main issues I've run into is just it's a constant uphill battle. People will somehow find larger and larger files. You'll end up with files that are like thousands, like tens of thousands of lines long and you think yeah, surely no 1 would expect like instantaneous response for like like editing a file that has like tens of thousands of lines, but then they do. As far as corner cases go, I would say the corner case is like, just in general is actually distributing the language server. Cause like writing the language server is fine. Like wiring everything up is fine. But then like, once you actually have to go and distribute it, well, now you're distributing in a binary. Like I was saying before with OCaml, doesn't have great cross compilation. So for some graph for our language server, we target Linux and Mac OS, and we have a ton of people who use Windows, but compiling OCaml for Windows is basically impossible. So our corner case there, the way we solved it was now we're transpiling OCaml to JavaScript, which is a huge can of worms. Like it's a lot of fun. It's very interesting, but like it's not ideal. And so that's what I was saying before. I recommend like Rust or TypeScript because those are way more portable and a lot easier to install. And you don't have to worry about any of that weird packaging stuff. So yeah, I would say that's like the main corner case and the main limitation is just speed and caching. someone doesn't want to refactor or something. How did you start? So did you have any way to still be relatively performant when they have big files or is it just not supported? I don't care. And the way we ended up doing that, so SemGrep is like you write this generic pattern. You kind of write the language, but then there's these other symbols and stuff that are included in that, this like meta language. And so what happens is, is most languages get, they get parsed and then into a syntax tree, right? Like whatever the language is syntax tree is, and then they get, the syntax tree gets converted into this, like, we call it like an abstract syntax tree, which is like abstract from like any, like languages specific syntax tree. And so then we can cache that, which is really good because like if someone types something like we don't have to go through and do like the full parsing and like converting, we only have to do it incrementally. And so that's, that's how we dealt with that. Or the other option is that we just, we just cache whatever the previous results are, and then run it asynchronously, and they might get it delayed. But we've ended up doing more AST caching, which is fun and cool. Blaine. If Eaglet is a subset of LSP mode, can EGLOT conflict with LSP mode if both are present in your initial .el file? mode a ton, so I'm not 100% sure. I think all of the key bindings and commands, if you just install it out of the box, I Think they're different. So I don't think there's like any like overlap as far as that stuff goes but you will have the overlap of like you entered, like you started a major mode for like some language, like they'll both probably start the language server and provide diagnostics and everything. And so then now you're getting like, you're just like doubling the work your computer is doing. So there's that conflict. But if you prefer EGLOT mode or LSP mode for like 1 language or framework, like 1 major mode and LSP mode for the other, I think you should be fine. we have like 1 minute on the stream and then we'll switch back and to the pre-recorded stuff I guess. interruption but I'm just doing a little bit of time keeping so thank you so much Austin sadly I wasn't able to follow the Q&A because I was in the other track answering questions. If, Austin, you want to stay and answer some more questions, feel free to do so. People tend to start talking as soon as we go off air, And I wouldn't be surprised with LSP that people would do the same. We're gonna move on for this track. We're gonna move on in 20 seconds to the next

  1. So Floey, thank you for hosting.
Austin, thank you for all your answers. And We'll see you in a bit. Thank you so much, Austin. I'm going to go back running in the background. And thank you, Flowey, for everything. probably a nice day at your work. Yeah, it's still it's like lunchtime for me. 09:00. 9pm. Thanks for the talk. Sorry for the inconvenience was not having any, any questions, really. It's like, there's like no documentation on any of this stuff. So I didn't really expect any. into NeoVim. I write it 1 or 2 things on my own, but never really got really deep into it. And you're gonna see with like compiler design and stuff like this, but not really specific. So I was It's like, it's, LSP is cool, but then you have to like deal with all the like compiler stuff and programming language theory. complicated. I had not really a question, so, but it worked out fine. Thanks for the Q and A. And if I have any questions to Oak Hamill, Elderspeak will get an email from you.

Questions or comments? Please e-mail austin@cutedogs.org