Back to the schedule
Previous: Old McCarthy Had a Form
Next: Test blocks
Turbo Bindat
Stefan Monnier
Q&A: live
Duration: 29:48
This talk was also streamed at an alternate time for APAC hours: https://libreau.org/past.html#emacsconf21
If you have questions and the speaker has not indicated public contact information on this page, please feel free to e-mail us at emacsconf-submit@gnu.org and we'll forward your question to the speaker.
Talk
00:01 Introduction 02:06 What is BinDat? 05:27 Conversion to lexical scoping 08:30 The BinDat specification 15:35 New design 17:47 Documentation 19:30 Advantages 21:51 New features 23:08 Examples 27:56 Conclusion 28:28 Negatives
Q&A
00:00 bindat seems very similar to GNU Poke. 00:55 Is your dog's name something Lisp or PL related? 01:15 Is it merged into mainline Emacs, a patch, an external library? 01:35 Are there benchmarks of this vs. the older bindat? 02:13 Do you know of any CL or Scheme libs similar to bindat.el? 02:55 You are a hero of kittens everywhere. Do you have any feline pets as well? 03:47 (Q&A logistics) 05:35 I hope cl-loop is more efficient than intermediate lists... 06:16 BBB chat: Curious: how is GNU Poke more flexible? 07:55 How Stefan got involved with bindat 08:33 BBB chat: What hobbies/interests do you have besides Emacs (and PL)? 09:42 BBB chat: Thoughts on making Emacsconf better? 11:40 BBB chat: Poke's from-scratch DSL vs. building on an existing language 14:10 Winnie the dog interjects. 15:15 BBB chat: Favorite talks so far? 19:00 BBB chat: What kind of dog is Winnie? 20:05 BBB chat: More control over types coming into Elisp? 24:15 Andrea Corallo joins discussion about types and performance. 38:19 BBB chat: Do you plan to add bit-level support? 41:15 Is there an automated way to convert bindat C type specs to Lisp specs? 43:00 BBB chat: That's a classic hard problem that essentially requires a C compiler. 43:51 BBB chat: And there's a problem of object size being arch dependent. 44:54 BBB chat: Parsing a generic .h file is way more difficult. 46:05 BBB chat: Automatic translation is more for automatically writing C bindings. 46:50 Thanks
Description
Table of Contents
Bindat is an ELisp library to help manipulate binary data. This is a niche library that is used by packages such as Websocket, EMMS, and cpio-mode. Its implementation was repeatedly caught harassing hapless kitten while at the same time providing poor service slowly. For Emacs-28, Bindat was rewritten so as to make it more efficient and flexible while respecting the kitten. In this presentation I intent to show how we saved those. Not recommended for birds.
Discussion
- Q1: bindat seems very similar to GNU Poke (except that GNU Poke is a
superset, and then some, with a different syntax). I'm wondering if
it might be good to add a bindat variant that translates to/from
Poke if need be (using libpoke), for sheer insane blazing
native-code JITted speed. (And, later, maybe letting bindat gain
some of the insanely expressive capabilities GNU Poke has got). Its
use of eval blocked this in times past. but now...
- A:GNU Poke is indeed the natural evolution, and is much more powerful. Given the fairly little use of BinDat so far, I'm not sure there will be enough motivation to give access to GNU Poke from Emacs, tho. One of the main benefits of using GNU Poke would probably be that lots of formats are already available for GNU Poke, so you could directly re-use them.
- Q2: Is your dog's name something Lisp or PL related...?
- A:Winnie? I don't think so, no (we didn't choose the name, in any case)
- Q3: This looks amazing! Is it merged into mainline Emacs, a patch,
an external library?
- A: It's in Emacs-28
- Q4: Are there benchmarks of this vs. the older bindat?
- A:There is a benchmark for it in the
elisp-benchmarks
- A:There is a benchmark for it in the
- Q5: Do you know of any CL or Scheme libs similar to bindat.el?
- A: No, but I'd be interested to hear about it if someone else does.
- Q7: You are a hero of kittens everywhere. Do you have any feline
pets as well?
- A: Not yet. If you're near Montreal and you have a kitten for me, I'm interested
- I hope cl-loop is more efficient than building a bunch of intermediate lists when you chain map/filter/reduce operations.
- Curious: how is gnu poke more flexible?
- What hobbies/interests do you have besides Emacs (and PL)?
- do you have any thoughts about how to make EmacsConf even better next year?
- I was surprised to see that a whole new DSL was developed for poke from scratch. Do you think would have been better to develop/improve a library like bindat on top of an existing language instead?
- What are some of your favorite talks from this conf so far?
- what kind of dog is Winnie?
- comment: I hadn't heard of that breed before
- How do you see more control over types (type hints/decl through type specifiers etc) (SBCL like programming model) coming into Elisp?
- Do you plan to add bit-level support?
Other comments:
- I can imagine using bindat to improve Emacs's music player packages
- yes last year the Q&A periods were much longer
- last year some of the presentations were live though
- I've asked this question to them during LPC 2020 but infact haven't got a very satisfactory answer
- If you ever write a library for window management in Emacs, you could call it winnie.el
- hints in unoptimized code should be assertions
- we probably need both ways of compiling: safe and less safe
- I think this is classic problem that is almost impossible to accomplish. many libraries try to do that but in the end the only working ones are relaying on C compilers.
- also you have the problem of size of objects. like how big is a long? this is not specified and is arch dependent
- parsing a generic .h file is way more difficult but is another subject.
- yep, the automatic translation is more for libraries trying to write automatically C bindings
Transcript
[00:00:01.360] Hi. So I'm going to talk today about a fun rewrite I did of the BinDat package. I call this Turbo BinDat. Actually, the package hasn't changed name, it's just that the result happens to be faster. The point was not to make it faster though, and the point was not to make you understand that data is not code. It's just one more experience I've had where I've seen that treating data as code is not always a good idea. It's important to keep the difference. So let's get started.
[00:00:38.881] So what is BinDat anyway? Here's just the overview of basically what I'm going to present. So I'm first going to present BinDat itself for those who don't know it, which is probably the majority of you. Then I'm going to talk about the actual problems that I encountered with this package that motivated me to rewrite it. Most of them were lack of flexibility, and some of it was just poor behavior with respect to scoping and variables, which of course, you know, is bad -- basically uses of eval or, " ;eval is evil." ; Then I'm going to talk about the new design -- how I redesigned it to make it both simpler and more flexible, and where the key idea was to expose code as code instead of having it as data, and so here the distinction between the two is important and made things simpler. I tried to keep efficiency in mind, which resulted in some of the aspects of the design which are not completely satisfactory, but the result is actually fairly efficient. Even though it was not the main motivation, it was one of the nice outcomes. And then I'm going to present some examples.
[00:02:06.007] So first: what is BinDat? Oh actually, rather than present THIS, I'm going to go straight to the code, because BinDat actually had an introduction which was fairly legible. So here we go: this is the old BinDat from Emacs 27 and the commentary starts by explaining what is BinDat? Basically BinDat is a package that lets you parse and unparse basically binary data. The intent is to have typically network data or something like this. So assuming you have network data, presented or defined with some kind of C-style structs, typically, or something along these lines. So you presumably start with documentation that presents something like those structs here, and you want to be able to generate such packets and read such packets, so the way you do it is you rewrite those specifications into the BinDat syntax. So here's the BinDat syntax for the the previous specification. So here, for example, you see the case for a data packet which will have a 'type' field which is a byte (an unsigned 8-bit entity), then an 'opcode' which is also a byte, then a 'length' which is a 16-bit unsigned integer in little endian order, and then some 'id' for this entry, which is 8 bytes containing a zero-terminated string, and then the actual data, basically the payload, which is in this case a vector of bytes, ('bytes' here doesn't doesn't need to be specified) and here we specify the length of this vector. This 'length' here happens to be actually the name of THIS field, so the length of the data is specified by the 'length' field here, and BinDat will understand this part, which is the the nice part of BinDat. And then you have an alignment field at the end, which is basically padding. It says that it is padded until the next multiple of four. Okay. So this works reasonably well. This is actually very nice. With this, you can then call bindat-pack or bindat-unpack, passing it a string, or passing it an alist, to do the packing and unpacking. So, for example, if you take this string-- actually, in this case, it's a vector of bytes but it works the same; it works in both ways-- if you pass this to bindat-unpack, it will presumably return you this structure if you've given it the corresponding type. So it will extract-- you will see that there is an IP address, which is a destination IP, a source IP, and some port number, and some actual data here and there, etc. So this is quite convenient if you need to do this, and that's what it was designed for. So here we are. Let's go back to the actual talk.
[00:05:27.538] I converted BinDat to lexical scoping at some point and things seemed to work fine, except, at some point, probably weeks later, I saw a bug report about the new version using lexical scoping not working correctly with WeeChat. So here's the actual chunk of code that appears in WeeChat. Here you see that they also define a BinDat spec. It's a packet that has a 32-bit unsigned length, then some compression byte/compression information, then an id which contains basically another struct (which is specified elsewhere; doesn't matter here), and after that, a vector whose size is not just specified by 'length', but is computed from 'length'. So here's how they used to compute it in WeeChat. So the length here can be specified in BinDat. Instead of having just a reference to one of the fields, or having a constant, you can actually compute it, where you have to use this '(eval', and then followed by the actual expression where you say how you compute it. And here you see that it actually computes it based on the 'length of the structure -- that's supposed to be this 'length' field here -- and it's referred to using the bindat-get-field to extract the field from the variable 'struct'. And then it subtracts four, it subtracts one, and adds some other things which depend on some field that's found in this 'id' field here. And the problem with this code was that it broke because of this 'struct' variable here, because this 'struct' variable is not defined anywhere in the specification of BinDat. It was used internally as a local variable, and because it was using dynamic scoping, it actually happened to be available here, but the documentation nowhere specifies it. So it was not exactly a bug of the conversion to lexical scoping, but it ended up breaking this code. And there was no way to actually fix the code within the specification of BinDat. You had to go outside the specification of BinDat to fix this problem. This is basically how I started looking at BinDat. Then I went to actually investigate a bit more what was going on, and the thing I noticed along the way was basically that the specification of BinDat is fairly complex and has a lot of eval and things like this.
[00:08:30.749] So let's take a look at what the BinDat specification looks like. So here it's actually documented as a kind of grammar rules. A specification is basically a sequence of items, and then each of the items is basically a FIELD of a struct, so it has a FIELD name, and then a TYPE. Instead of a TYPE, it could have some other FORM for eval, which was basically never used as far as I know, or it can be some filler, or you can have some 'align' specification, or you can refer to another struct. It could also be some kind of union, or it can be some kind of repetition of something. And then you have the TYPE specified here, which can be some integers, strings, or a vector, and there are a few other special cases. And then the actual field itself can be either a NAME, or something that's computed, and then everywhere here, you have LEN, which specifies the length of vectors, for example, or length of strings. This is actually either nil to mean one, or it can be an ARG, where ARG is defined to be either an integer or DEREF, where DEREF is basically a specification that can refer, for example, to the 'length' field -- that's what we saw between parentheses: (length) was this way to refer to the 'length' field. Or it can be an expression, which is what we saw in the computation of the length for WeeChat, where you just had a '(eval' and then some computation of the length of the payload. And so if you look here, you see that it is fairly large and complex, and it uses eval everywhere. And actually, it's not just that it has eval in its syntax, but the implementation has to use eval everywhere, because, if you go back to see the kind of code we see, we see here we just define weechat--relay-message-spec as a constant! It's nothing than just data, right? So within this data there are things we need to evaluate, but it's pure data, so it will have to be evaluated by passing it to eval. It can't be compiled, because it's within a quote, right? And so for that reason, kittens really suffer terribly with uses of BinDat. You really have to be very careful with that. More seriously, the 'struct' variable was not documented, and yet it's indispensable for important applications, such as using in WeeChat. So clearly this needs to be fixed. Of course, we can just document 'struct' as some variable that's used there, but of course we don't want to do that, because 'struct' is not obviously a dynamically scoped variable, so it's not very clean. Also other problems I noticed was that the grammar is significantly more complex than necessary. We have nine distinct non-terminals. There is ambiguity. If you try to use a field whose name is 'align', or 'fill', or something like this, then it's going to be misinterpreted, or it can be misinterpreted. The vector length can be either an expression, or an integer, or a reference to a label, but the expression should already be the general case, and this expression can itself be just a constant integer, so this complexity is probably not indispensable, or it could be replaced with something simpler. That's what I felt like. And basically lots of places allow an (eval EXP) form somewhere to open up the door for more flexibility, but not all of them do, and we don't really want to have this eval there, right? It's not very convenient syntactically either. So it makes the uses of eval a bit heavier than they need to be, and so I didn't really like this part. Another part is that when I tried to figure out what was going on, dog barks and distracts Stefan I had trouble... Winnie as well, as you can hear. She had trouble as well. But one of the troubles was that there was no way to debug the code via Edebug, because it's just data, so Edebug doesn't know that it has to look at it and instrument it. And of course it was not conveniently extensible. That's also one of the things I noticed along the way. Okay, so here's an example of problems not that I didn't just see there, but that were actually present in code. I went to look at code that was using BinDat to see what uses looked like, and I saw that BinDat was not used very heavily, but some of the main uses were just to read and write integers. And here you can see a very typical case. This is also coming from WeeChat. We do a bindat-get-field of the length of some struct we read. Actually, the struct we read is here. It has a single field, because the only thing we want to do is actually to unpack a 32-bit integer, but the only way we can do that is by specifying a struct with one field. And so we have to extract this struct of one field, which constructs an alist containing the actual integer, and then we just use get-field to extract it. So this doesn't seem very elegant to have to construct an alist just to then extract the integer from it. Same thing if you try to pack it: you first have to construct the alist to pass it to bindat-pack unnecessarily. Another problem that I saw in this case (it was in the websocket package) was here, where they actually have a function where they need to write an integer of a size that will vary depending on the circumstances. And so they have to test the value of this integer, and depending on which one it is, they're going to use different types. So here it's a case where we want to have some kind of way to eval -- to compute the length of the integer -- instead of it being predefined or fixed. So this is one of the cases where the lack of eval was a problem. And actually in all of websocket, BinDat is only used to pack and unpack integers, even though there are many more opportunities to use BinDat in there. But it's not very convenient to use BinDat, as it stands, for those other cases.
[00:15:35.891] So what does the new design look like? Well in the new design, here's the problematic code for WeeChat. So we basically have the same fields as before, you just see that instead of u32, we now have 'uint 32' separately. The idea is that now this 32 can be an expression you can evaluate, and so the u8 is also replaced by 'uint 8', and the id type is basically the same as before, and here another difference we see, and the main difference... Actually, it's the second main difference. The first main difference is that we don't actually quote this whole thing. Instead, we pass it to the bindat-type macro. So this is a macro that's going to actually build the type. This is a big difference in terms of performance also, because by making it a macro, we can pre-compute the code that's going to pack and unpack this thing, instead of having to interpret it every time we pack and unpack. So this macro will generate more efficient code along the way. Also it makes the code that appears in here visible to the compiler because we can give an Edebug spec for it. And so here as an argument to vec, instead of having to specify that this is an evaluated expression, we just write the expression directly, because all the expressions that appear there will just be evaluated, and we don't need to use the 'struct' variable and then extract the length field from it. We can just use length as a variable. So this variable 'length' here will refer to this field here, and then this variable 'id' here will refer to this field here, and so we can just use the field values as local variables, which is very natural and very efficient also, because the code would actually directly do that, and the code that unpacks those data will just extract an integer and bind it to the length variable, and so that makes it immediately available there.
[00:17:47.580] Okay, let's see also what the actual documentation looks like. And so if we look at the doc of BinDat, we see the actual specification of the grammar. And so here we see instead of having these nine different non-terminals, we basically have two: we have the non-terminal for TYPE, which can be either a uint, a uintr, or a string, or bits, or fill, or align, or vec, or those various other forms; or it can be a struct, in which case, in the case of struct, then it will be followed by a sequence -- a list of FIELDs, where each of the FIELDs is basically a LABEL followed by another TYPE. And so this makes the whole specification much simpler. We don't have any distinction now between struct being a special case, as opposed to just the normal types. struct is just now one of the possible types that can appear here. The other thing is that the LABEL is always present in the structure, so there's no ambiguity. Also all the above things, like the BITLEN we have here, the LEN we have here, the COUNT for vector we have here, these are all plain Elisp expressions, so they are implicitly evaluated if necessary. If you want them to be constant, and really constant, you can just use quotes, for those rare cases where it's necessary. Another thing is that you can extend it with with bindat-defmacro. Okay, let's go back here.
[00:19:30.226] So what are the advantages of this approach? As I said, one of the main advantages is that we now have support for Edebug. We don't have 'struct', 'repeat', and 'align' as special cases anymore. These are just normal types. Before, there was uint as type, int as type, and those kinds of things. 'struct' and 'repeat' and 'align' were in a different case. So there were some subtle differences between those that completely disappeared. Also in the special cases, there was 'union', and union now has completely disappeared. We don't need it anymore, because instead, we can actually use code anywhere. That's one of the things I didn't mention here, but in this note here, that's one of the important notes. Not only are BITLEN, LEN, COUNT etc. Elisp expressions, but the type itself -- any type itself -- is basically an expression. And so you can, instead of having 'uint BITLEN', you can have '(if blah-blah-blah uint string)', and so you can have a field that can be either string or an int, depending on some condition. And for that reason we don't need a union. Instead of having a union, we can just have a 'cond' or a 'pcase' that will return the type we want to use, depending on the context, which will generally depend on some previous field. Also we don't need to use single-field structs for simple types anymore, because there's no distinction between struct and other types. So we can pass to bindat-pack and bindat-unpack a specification which just says " ;here's an integer" ; and we'll just pack and unpack the integer. And of course now all the code is exposed, so not only Edebug works, but also Flymake, and the compiler, etc. -- they can complain about it, and give you warnings and errors as we like them. And of course the kittens are much happier. Okay. This is going a bit over time, so let's try to go faster.
[00:21:51.273] Here are some of the new features that are introduced. I already mentioned briefly that you can define new types with bindat-defmacro. that's one of the important novelties, and you can extend BinDat with new types this way. The other thing you can do is you can control how values or packets are unpacked, and how they are represented. In the old BinDat, the packet is necessarily represented, when you unpack it, as an alist, basically, or a struct becomes an alist, and that's all there is. You don't have any choice about it. With the new system, by default, it also returns just an alist, but you can actually control what it's unpacked as, or what it's packed from, using these keywords. With :unpack-val, you can give an expression that will construct the unpacked value from the various fields. And with :pack-val and :pack-var, you can specify how to extract the information from the unpacked value to generate the pack value.
[00:23:08.078] So here are some examples. Here's an example taken from osc. osc actually doesn't use BinDat currently, but I have played with it to see what it would look like if we were to use BinDat. So here's the definition of the timetag representation, which represents timestamps in osc. So you would use bindat-type and then you have here :pack-var basically gives a name when we try to pack a timestamp. 'time' will be the variable whose name contains the actual timestamp we will receive. So we want to represent the unpacked value as a normal Emacs timestamp, and then basically convert from this timestamp to a string, or from a string to this timestamp. When we receive it, it will be called time, so we can refer to it, and so in order to actually encode it, we basically turn this timestamp into an integer -- that's what this :pack-val does. It says when we try to pack it, here's the the value that we should use. We turn it into an integer, and then this integer is going to be encoded as a uint 64-bit. So a 64-bit unsigned integer. When we try to unpack the value, this 'ticks' field will contain an unsigned int of 64 bits. We want to return instead a timestamp -- a time value -- from Emacs. Here we use the representation of time as a pair of number of ticks and the corresponding frequency of those ticks. So that's what we do here with :unpack-val, which is construct the cons corresponding to it. With this definition, bindat-pack/unpack are going to convert to and from proper time values on one side, and binary strings on the other. Note, of course, that I complained that the old BinDat had to use single-field structs for simple types, and here, basically, I'm back using single-field structs as well for this particular case -- actually a reasonably frequent case, to be honest. But at least this is not so problematic, because we actually control what is returned, so even though it's a single-field struct, it's not going to construct an alist or force you to construct an alist. Instead, it really receives and takes a value in the ideal representation that we chose. Here we have a more complex example, where the actual type is recursive, because it's representing those " ;LEB" ;... I can't remember what " ;LEB" ; stands for, but it's a representation for arbitrary length integers, where basically every byte is either smaller than 128, in which case it's the end of the of the value, or it's a value bigger than 128, in which case there's an extra byte on the end that's going to continue. Here we see the representation is basically a structure that starts with a byte, which contains this value, which can be either the last value or not, and the tail, which will either be empty, or contain something else. The empty case is here; if the head value is smaller than 128, then the type of this tail is going to be (unit 0), so basically 'unit' is the empty type, and 0 is the value we will receive when we read it. And if not, then it has as type 'loop', which is the type we're defining, so it's the recursive case, where then the rest of the type is the type itself. And so this lets us pack and unpack. We pass it an arbitrary size integer, and it's going to turn it into this LEB128 binary representation, and vice versa. I have other examples if you're interested, but anyway, here's the conclusion.
[00:27:56.094] We have a simpler, more flexible, and more powerful BinDat now, which is also significantly faster. And I can't remember the exact speed-up, but it's definitely not a few percents. I vaguely remember about 4x faster in my tests, but it's probably very different in different cases so it might be just 4x, 2x -- who knows? Try it for yourself, but I was pretty pleased, because it wasn't the main motivation, so anyway...
[00:28:28.336] The negatives are here. In the new system, there's this bindat-defmacro which lets us define, kind of, new types, and bindat-type also lets us define new types, and the distinction between them is a bit subtle; it kind of depends on... well it has an impact on efficiency more than anything, so it's not very satisfactory. There's a bit of redundancy between the two. There is no bit-level control, just as before. We can only manipulate basically bytes. So this is definitely not usable for a Huffman encoding kind of thing. Also, it's not nearly as flexible as some of the alternatives. So you know GNU Poke has been a vague inspiration for this work, and GNU Poke gives you a lot more power in how to specify the types, etc. And of course one of the main downsides is that it's still not used very much. Actually the new BinDat is not used by any package as far as I know right now, but even the old one is not used very often, so who knows whether it's actually going to work very much better or not? Anyway, this is it for this talk. Thank you very much. Have a nice day. (captions by John Cummings)
Back to the schedule
Previous: Old McCarthy Had a Form
Next: Test blocks