What I learned by writing test cases for GNU Hyperbole
Mats Lidell (he, him, his) - IRC: matsl, @matsl@mastodon.acc.sunet.se, matsl@gnu.org
Format: 27-min talk ; Q&A: BigBlueButton conference room
Status: Q&A to be extracted from the room recordings
Talk
Duration: 26:55 minutes00:03.120 Introduction 03:11.160 ERT: Emacs Lisp Regression Testing 04:14.360 Assertions with
should
04:56.920 Running a test case 06:54.560 Debug a test 07:50.380 Commercial break: Hyperbole 09:10.480 Instrument function on the fly 10:39.120 Mocking 14:41.240 cl-letf 15:24.100 Hooks 15:55.720 Side effects and initial buffer state 17:05.100 with-temp-buffer 17:16.520 make-temp-file 17:33.288 buffer-string 18:09.920 buffer-name 18:51.980 major-mode 19:02.680 unwind-protect 20:15.100 Input, with-simulated-input 21:38.460 Running all tests 23:03.220 Batch mode 24:05.060 Skipping tests 26:08.460 Conclusion
Q&A
Description
I'm maintaining GNU Hyperbole. I volunteered for that at a time when FSF was asking for one since it was unmaintained. I did not have much elisp experience but a passion for the package. Not much happened.
To my great delight a few years ago the author of Hyperbole Bob Weiner joined the band and we started together to actively develop Hyperbole again.
One of my focus areas in that work has been to add test cases. We have now gone from no tests to over 300 ert tests for the package. This talk is about my test case journey. What I have learned by doing that.
Discussion
Questions and answers
- Q:How many tests do you have for Hyperbole and how wouild you rate
the test coverage compared to other packages?
- A:
- With all tests including the interactive we have 354 tests. Havng said that I must point out that the size of the tests can be very different. I tend to split tests so they are logically (in some sense) different. So that if a test fails it will more likely point you to what the error is. This makes it become more tests. Codewise you could collect similar tests to one ert-deftest making the name of the test point out some group or collection of functions, but I don't do that!
- I have not studied other packages so I don't know how our test coverage compares to other packages. In fact I don't know what code coverage we have. That is another thing to look into.
- A:
- Q: One small suggestion, to me 'should' means optional, whereas
'shall' or 'must' means required. Not sure if it is too late to
make a major grammar change like that Very nice presentation. (I
see
- A: The assertions come from the ert package so any changes would have to be suggested to that. I guess you could make your own version of the assestions using aliases for should et al.
- Q: FYI, you may find this helpful for running Emacs tests/lints,
both from a command line and from within Emacs with a Transient
menu: https://github.com/alphapapa/makem.sh It also works on
remote CI.
- A: Thanks for the suggestion. I did have a look at makem.sh but a long time ago so I don't remember why we did not try to apply it. I might give it another look now when I have used plain ert more.
- Q: Is it easy to run ad hoc tests inside of an Emacs session, given
the command line scripts you need to run to get a batch test session
running? In other words, can you tweak tests in an Emacs session and
run them right away?
- A:
- Yes, in principle you just load your tests and run them all
using
ert
and give it the test selectort
. That runs all loaded tests. - If you want to modify a test you can do that. You change it, evaluate it, and run it again. Just as you change any function.
- Yes, in principle you just load your tests and run them all
using
- A:
- Q: Did you have to change Hyperbole code and design to be more
readily testable as you were increasing your test coverage?
- A:
- Yes, we have done that to a small extent but we should do more of that. Some Hyperbole functions are large and by that complicated to test. Splitting them into smaller logical parts can make testing easier.
- Also moving code into pure functions and avoid side effects is a good thing. Pure functions are easier to test. Maybe haveing the side effects separated out into fewer places. This has not been applied but is something I have been thinking about. With side effects I here mean things like adding or modifying text in buffers.
- A:
- Q: What's the craziest bug you found when writing these tests?
- A: This is not a bug but I always assumed giving a prefix argument to a cursor movement would give the same result as hitting the key the same amount of times. So like C-u 2 C-f would be the same as hitting the C-f key twise. It is not! When moving over a hidden area, the three dots '...' at the end of folded line in org-mode or outline-mode, you get different behavior. Trying to write a test case for the kotl-mode and its folded behavior teached me that.
- Q: Why do you prefer el-mock to mocking using cl-letf. (Question
asked in BBB)
- With cl-letf you need to keep track if the mocked functionality is being called or not. The el-mock package does that for you which is what you normally want. Doing this with cl-letf means the definition becomes longer and more complicated. Sort of blurs the picture. el-mock is more to the point.
- BUT since cl-letf does allow you do define a "new" function it is more powerful and it can be the only option in cases where el-mock is too limited. So it is good to know of this possibility with cl-letf when el-mock does not provide what you need.
Transcript
ert-deftest
.
In its simplest form, a test has a name, a doc string, and a body.
The doc string is where you typically can give
a detailed description of the test
and has space for more info
than what can be given in the test name.
The body is where all the interesting things happen.
It is here you prepare the test, run it and verify the outcome.
Schematically, it looks like this.
You have the ert-deftest, you have the test name,
and the doc string, and then the body.
It is in the body where everything interesting happens.
The test is prepared, the function of the test is executed,
and the outcome of the test is evaluated.
Did the test succeed or not?
should
should
together with a set of related macros.
should
takes a form as argument,
and if the form evaluates to nil,
the test has failed. So let's look at an example.
This simple test verifies that the function +
can add the numbers 2 and 3 and get the result 5.
ert
. It takes a test selector.
The test name works as a selector for running just one test.
So here we have the example. Let's evaluate it.
We define it and then we run it using ERT.
As you see, we get prompted for a test selector
but we only have one test case defined at the moment.
It's the example 0. So let's hit RET.
As you see here, we get some output
describing what we have just done.
There is one test case it has passed, zero failed,
zero skipped, total 1 of 1 test case
and some time stamps for the execution.
We also see this green mark here indicating one test case
and that it was successful.
For inspecting the test, we can hit the letter l
which shows all the should
forms
that was executed during this test case.
So here we see that we have the should
,
one should
executed, and we see the form equals to 2,
and it was 5 equals to 5.
So a good example of a successful test case.
ert-deftest
can be set up using edebug-defun
,
just as a function or macro is set up
or instrumented for debugging. So let's try that.
So we try edebug-defun
here.
Now it's instrumented for debugging.
And we run it, ert
, and we're inside the debugger,
and we can inspect here what's happening.
Step through it and yes it succeeded just as before.
ert-deftest
as an implicit button. An implicit button is basically
a string or pattern
that Hyperbole has assigned some meaning to.
For the string ert-deftest
, it is to run the test case.
You activate the button with the action-key.
The standard binding is the middle mouse button,
or from the keyboard, M-RET.
So let's try that.
We move the cursor here and then we type M-RET.
And boom, the test case was executed.
And to run it in debug mode we type C-u M-RET
to get the assist key, and then we're in the debugger.
So that's pretty useful and convenient.
debug-mode
.
It allows you to step into a function
and continue debugging from there.
For the cases where your test does not do what you want,
looking at what happens in the function of the test
can be really useful. Let's try that with another example.
So here we have two helper functions, one f1-add
,
that use the built-in +
function
and then we have my-add
that uses that function.
So we're going to test myadd.
And then let's run this.
Let's run this using hyperbole in debug mode
C-u M-RET. We're in the debugger again,
and let's step up front to my function under test
and then press i
for getting it instrumented
and going into it for debugging.
And here we can expect that it's getting
the arguments 1 and 3,
and it returns the result 4 as expected.
And yes, of course, our test case will then succeed.
el-mock
.
The workhorse in this package is the with-mock
macro.
It looks like this: with-mock
followed by a body.
In the execution of the body, stubs and mocks
defined in the body is respected.
Let's look at some examples to make that clearer.
In this case, we have the macro with-mock
.
It works so that the expression
stub + => 10
is interpreted
so that the function +
will be replaced with the stub.
The stub will return 10 regardless how it is called.
Note that the stub function
does not have to be called at this level
but could be called at any level in the call chain.
By knowing how the function under test is implemented
and how the implementation works,
you can find function calls you want to mock
to force certain behavior that you want to test,
or to avoid calls to external resources, slow calls, etc.
Simply isolate the function under test
and simulate its environment.
Mock is a little bit more sophisticated
and depends on the arguments
that the mock function is called with.
Or more precise, it is checked
after the with-mock
clause
that the arguments match the arguments it was called with
or even if it was called at all.
If it is called with other arguments
there will be an error,
and if it's not called, it is also an error.
So this way, we are sure that the function
we were expected to be called actually was called.
An important piece of the testing.
So we are sure that the mock we have provided
actually is triggered by the test case.
So here we have an example of with-mock
where the f1-add
function is mocked,
so that if it's called with 2 and 3 as arguments,
it will return 10. Then we have a test case
where we try the my-add
function,
as you might remember, and call that with 2 and 3
and see that it should also then return 10
because it's using f1-add
.
cl-letf
.
In rare occasions, the limitations of el-mock
means
you would want to implement a full-fledged function
to be used under test.
Then the macro cl-letf
can be useful.
However, you need to handle the case yourself
if the function was not called.
Looking through the test cases where I have used cl-letf
,
I think most can be implemented using plain mocking.
Cases left is where the args to the mock might be different
due to environment issues.
In that case, a static mock will not work.
with-temp-buffer
:
it provides you a temp buffer that you visit,
and afterwards, there is no need to clean up.
This is the first choice if that is all you need.
make-temp-file
: If you need a file,
this is the function to use.
It creates a temp file or a directory.
The file can be filled with initial contents.
This needs to be cleaned up after a test.
Moving on to verifying and debugging:
buffer-string
: returns the full contents
of the buffer as a string.
That can sound a bit voluminous,
but since tests are normally small, this often works well.
I have in particular found good use of comparing
the contents of buffers with the empty string.
That would give an error, but as we have seen
with the output produced by the should
assertion,
this is almost like a print statement
and can be compared with the good old technique
of debugging with print statements.
There might be other ways to do the same
as we saw with debugging.
should
clauses
in the middle of the test execution
or after preparing the test input.
Sometimes Emacs can switch buffers in strange ways,
maybe because the test case is badly written,
and making sure your assumptions are correct
is a good sanity check.
Even the ert package does
some buffer and windows manipulation for its reporting
that I have not fully learned how to master,
so assertion for checking the sanity of the test is good.
major-mode
: Verify the buffer has the proper mode.
Can also be very useful and is a good sanity check.
unwind-protect
.
The tool for cleaning up is the unwind-protect
form
which ensures that the unwind forms
always are executed regardless of the outcome of the body.
So if your test fails, you are sure the cleanup is executed.
Let's look at unwind-protect together with
the temporary file example. Many tests look like this.
You create some resource, you call unwind-protect
,
you do the test, and then afterwards you do the cleanup.
The cleanup for a file and a buffer is so common,
so I have created a helper for that.
It looks like this.
The trick with the buffer-modified
flag
is to avoid getting prompted
for killing a buffer that is not saved.
The test buffers are often in the state
where they have not been saved but modified.
with-simulated-input
that gets you around these issues.
This is a macro that allows us
to define a set of characters
that will be read by the function under the test,
and all of this works in batch mode. It looks like this.
We have with-simulated-input
,
and then a string of characters, and then a body.
The form takes a string of keys
and runs the rest of the body,
and if there are input required,
it is picked from the string of keys.
In our example, the read-string
call
will read up until RET,
and then return the characters read.
As you see in the example, space needs to be provided
by the string SPC, as return by the string RET.
ert
command.
It prompts for a test selector.
If we give it the selector t
,
it will run all tests we have currently defined.
Let's try that with the subset of the Hyperbole tests.
Here is the test folder in the Hyperbole directory.
Let's go up here and load all the demo tests.
And then try to run ert
.
Now we see that we have a bunch of test cases.
We can all run them individually,
but we can run them with t
instead.
We will run them all at once.
So now, ert is executing all our test cases.
So here we have a nice green display
with all the test cases.
make
for repetitive tasks.
So we have a make target
that uses the ert batch functionality,
and this is the line from the Makefile.
This is a bit detailed,
but you see that we have a part here
where we load the test dependencies.
For getting the packages
such as el-mock
and with-simulated-input
etc. loaded.
We also have... I also want to point out here the call to
or the setting of auto-save-default
to nil
to get away with the prompt for excessive backup files
that can pile up after running the tests a few times.
(skip-unless (not noninteractive))
.
So when ert sees that the test should be skipped, it skips it
and makes a note of that,
so you will see how many tests that have been skipped.
Too bad. We have a number of test cases defined,
and to run them, we need to run them manually. Well sort of.
Not being able to run all tests easily
is a bit counterproductive
since our goal is to run all tests.
There is however no ert function to run tests in batch mode
with an interactive Emacs.
The closest I have got is either
to start the Emacs from the command line
calling the ert function as we just have seen,
and then killing it manually when done;
or add a function to extract the contents of the ERT buffer
when done and echo it to standard output.
This is how it looks in the Makefile
to get the behavior of cutting and paste,
getting the ERT output into a file
so we can then kill Emacs
and spit out the content of the ERT buffer.
One final word here is that
when you run this in a continuous integration pipeline,
you might not have a TTY for getting Emacs to start,
and that is then another problem
with getting the interactive mode.
Q&A transcript (unedited)
It's you and I. I have a question. How many tests do you have for hyperbole and How would you rate the test coverage compared to other packages? Well, that's a tricky 1. Shall I spell it out loud and then maybe type it at the same time? So, I believe it's around like more than 300 test cases now. But I cannot compare the test coverage to any other other package. Maybe I can type that later. What do you say, Badal? sure, yeah, that's totally fine. Feel free to just answer them with voice. 1 small suggestion to me, should means optional, where shall or must means required. Not sure if it is too late to make a major grammar change like that. Very nice presentation. So thanks for presentation, but the package ERT, well, it's not something that we have come up with. It's a standard package. So I believe it has been around for a long time. So, but please feel free to make suggestions and maybe you can, you know, like do a copy or like an alias for that. If you believe it makes more sense for your test cases to have that instead. And then we have another question here. For your info, you may find this helpful for running MX test lint both from a command line and from within MX with a transit menu. GitHub alpha papa make sure, yes. It also works on remote CI. Yeah, thank you, Alpha Papa. I think I've looked into that, but we haven't made any use of that. But maybe you'll inspire me to give it another look. Hi, Bob. Hey, how are you? Congratulations, man. Thanks, Hugh. Thank you. I have another question here. It is easy to run ad hoc tests inside an Emacs session given the command line scripts you need to run to get the batch test session running? You said it's to run an ad-hoc test. I'm not sure I understand that question. Yes, please. So I think what I understand is that since you have to use some of these command lines scripts to get a batch test session running, is it easy to run ad hoc tests in an Emacs session or does that, like in your experience, has that been difficult? if you look at the command line, you'll see that it's only like a few image functions to call to get that behavior to run the batch tests. So I think we made some support function for that in hyperbole. So it's not, I don't think it's possible out of the box to do it, but it's not complicated to do it. right? Just like a new function. So that's ad hoc. You just write your test and you can run it. but I got the impression it was about running all your tests like we did with the command line. Well, so the question is more about how would you run all your test cases from within Emacs? And the easy answer to that is actually you load all your test case files, and then you run ERT with the T as the test selector and then it will run all your test cases. their question a little bit as well, clarifying that. In other words, can you tweak tests in an Emacs session and run them right away? Which I believe, if I understand correctly what Bob was saying, you can basically define or redefine functions on the fly and then have them be run, right? you just change it and you run it again. And either you have to sort of load it or you can use like the commercial thing I did. You use hyperbole and just hit meta return on the test case and it will load it and run the test case again. So that's of course what you normally do when you're defining a test or debug a test case or develop a test case. Just start with something small, just make sure maybe you can prepare the test properly and run it again and again and again until you're ready with it. That's a good point. You can definitely do that and that's part of how I normally develop the test cases that I mean start with something small so I can see that I get there maybe the right input in the buffer that I want to test on or something and I expand on that more and more and add more and more more and more more how many test cases you have. I guess you commented on that and like what happens, you know, with the CICD pipeline, every time we commit, you know, across all the versions and what you have set up there because you know I wish people could see it. You can go and check on GitHub and you can see the logs right of any of the builds and but tell them a bit about that Mats because I think that's pretty impressive. CD, part of how we developed this using GitHub and workflows that you get out of the box from there. So this more than 300 test cases on our round for I think 5 different versions of Emacs when we do a pull request or a commit. So that's a good way to ensure that it works from version 27.2 up to the latest master version because there's some changes in Emacs over different versions that can affect your functions or your code. under 60 seconds I think you've got all of them run so you've got pretty extensive testing which does catch interesting bugs here and there, right? I mean, you normally develop with 1 version and then you think everything is okay. But then when you're tested with the different versions, you find out that there are some changes and there are things you might not sort of keep track of what's happening also. So that's a way to get noticed that the core developers of Emacs have changed something that you sort of based your code on. Now I got another question here. Did you have to change hyperbole code and design to be more readily testable as you were increasing your test coverage? Well, we haven't done that to a lot, to a big degree, although I believe that that is an important thing for sort of the future to do that because some of the hyperbolic functions are very complicated and long and that makes testing them rather difficult. So, at a few places we have sort of broken up functions in smaller pieces so it'd be easier to do like unit tests of the different parts of it. But there's a lot of more work that has to be done there. environment in Lisp where we're able to do a lot of interactive bottom-up testing before we even get to lighting tech pieces. So it does tend to be more higher level bugs, I think, that get caught in cross-functional interaction. We had 1 recently that was an Emacs version change. It had been a function that had existed for a long time. It had an and rest in it, in its argument list, so it would assemble the list of arguments from individual arguments that you would give it, and they decided in a recent version, I think with Stefan's input, to change that to a list and allow the prior behavior, but it would issue a warning if you use the prior behavior. So all of a sudden, the way you were supposed to do it became semi-invalid. And so we started getting the warning, and we've tried to eliminate all those warnings in recent hyperbole developments. So we're like, what do we do? You know, because we wanted to be backward compatible to where you couldn't use a list. It required you to use individual arguments. And now it's sort of requiring you to do that. And all of that was caused by the automatic testing on it. So you said, Max, you were going to tell us what you learned. So what are the major things that you learned in doing all of this work? All of this work? presentation, but as I was going along, the presentation became like twice as long as fitted into the time we had so I had to cut it out. But I think some of the core things still is in the presentation. From a personal perspective, And this might not be hard to realize, but forcing yourself to test functions, test code really forces you to understand the code a little bit better in a way that sort of makes it easier than just to read the code. I don't know how it is for the rest listening to this, but for me it works so that if I just read the code then I don't sort of become as sharp as I should be but if I try to write the test case for it then I really need to understand better of all the edge cases and all the sort of states and etc that is involved and I think that's That's what's sort of 1 of the learning things I wanted to communicate as well that I don't think I covered in detail in the presentation. Maybe all this, but try it. 1 other sort of more from the fun side is that I really think it's fun to write the test. So if you haven't tests in your package, you should start doing that because it is fun. It might feel like some extra work, but it really pays off in the long run, especially if you have it in like a pipeline and where you can run it regularly when you do new commits, et cetera. So, I mean, that's maybe obvious from, if you look from the commercial side or your work side to do it like that. But even for your hobby project, it can be very sort of pay off really well. functionality or we're changing some of the plumbing in the system. You know, you go and you do some surgery and then you run the tests. And sometimes 6 to 10 tests will fail. And you find there, you know, it tends to be they're all interconnected and it leads you back to the single source. You fix that and you know it could be an edge case and off by 1 or Sometimes it's an assumption about the way something is used and it's not actually always true. And so, Matt's just really good at identifying some of those scenarios and keeping us honest, I guess I would say. So I love, I run it as much as I before, you know, even before I commit something. So I get to see, you know, if anything has progressed. So yeah, I really recommend this process to people. I haven't seen it done. I don't think that, I don't know any other package that has done it to this level. And it's been working really great for us. And I think, well, we'll see too when we release to the general public. different packages is not the first thing you look at. So I know there are packages that have testing, a lot of testing, but how much, much testing they have or not, I don't know. It's not what you normally look into when you look at someone's else code. You look maybe on the functionality side but not on how they've done the sort of the quality side. So there could be other packages out there that are well equipped. writing these tests? Well, What springs to my mind just now is that we were doing some tests or I would do some tests for when you narrow, what do you say that? When you, in outlining, when you sort of compress things in an outline, so you just, sorry Bob, maybe you have it, when you hide. So I was doing some cursor movement over that. And I always assume that if you do like a prefix argument to like a simple cursor movement, like control F moving 1 character position, and you would give it the, and then the prefix, like you want to move like 2 or 3 positions, you would do like control U 3 and then control F and you move 3. I always assumed that that would be exactly the same as if you just hit the key control F 3 times, but it's not. So it's not the bug, it's a feature, but that was the craziest thing. I spent the night trying to figure out why our code was wrong, but It turns out that's how Emacs behaves. Try it out yourself. Try to move over the 3 dots at the end of that and see what happens. Do it with cursor hitting the key or using a prefix argument and you see it behaves differently. That was the craziest thing. I think there was some other crazy thing or deep learning also, but I can't come up with it at the moment. So maybe I can write it in the Q&A later. but people are welcome to join Mats and Bob here on BigBlueButton to further discuss this. Thank you both. Makaay. Thank you. I don't know, Is it only me and Bob here? So Bob, do you want to say something? And I'm glad we did this. It takes a lot of energy. I'm just really excited about the progress that this, and we're actually doing a lot of QA at work and my professional software work and looking at you know how we can do more test driven development and so everybody's talking about this you know we've got AI over here that can generate test cases. But, you know, strangely enough, with the rapidity of development and web applications, I think the level of testing has gone down in recent years compared to where it used to be, right? Because the pace has gone up. And so I think it's starting to turn again where people are saying, we can't just release crap into the Webisphere and we have to better ourselves. And with all these advanced tool sets that you have, that you can do CICD testing, you know, I just, I just see it coming around, you know, as people develop new things. So That's kind of exciting to me because I came from a manufacturing culture originally where we, our company actually started a lot of the manufacturing quality efforts that you saw in Japan and elsewhere in America for a long time and that was you know entirely through testing. We used to just build incredible test cases because we were combining software with hardware. And if, you know, the hardware doesn't work and you ship a million units, you're, you're in trouble. So, that was just something we had to do. And so it's nice to start to see that curve come around. And I think, you know, Matt Vance is very modest, but I think he's really the 1 that started us down this path and really made it into a reality. So everybody else just gets to benefit from that work. So thanks. more here, then maybe we should just close this and I go over to write in the etherpad the replies we had. I see 1 other person here, I believe Ihor just joined us. Yeah. Yeah, so if you do want to discuss with Mats and Bob, you're welcome to, otherwise, yeah, we can close the room now. because I had power outage, but the part I heard was about the mock library. And you mentioned that you don't like CL-let, but instead you use mock. lot more work when you use the CL letdef. It's for more ambitious and maybe more complicated cases where you want to really make a new implementation, test implementation. If you use the mock, you get a lot of things out of the box, verifying that you actually, like the mock was actually called for instance, whereas if you do with the CLLatf, you would have to take correct track of that yourself. And so, so a lot of more work. Oh yeah. used for simple cases actually. Because, just for example, the function always returns the same. And it tends to be simple lambda that ignores all the input arguments. So that's really trivial most of the time but I actually thought the opposite that mock is supposed to be used for non-trivial cases. Mock was supposed to be used for non-trivial. Yeah I mean I don't know how to explain this. I mean, CLF can be used for non-trivial definitely. You can define then any behavior you want. You can write your own function, but you need to keep track of whether that function is called or not, for instance. So you have to make note of that the function was called so you can fire sort of an error in case your function wasn't called because that would be 1 error case. mocked function was actually not called? you sort of document with the mock also your assumptions how your code is going to be called. And if those are wrong, you will get an error. So you would, so if the implementation would maybe change, for instance, and not call the thing you're mocking, then you will notice that. But if you see a letdef, then you will have to keep track of that yourself. Okay, I see. I see. test. In our mode, we have a lot of tests we don't use third-party libraries at all. Yeah. Yeah. Yeah. At First I found it very powerful to use that, but then I sort of, I learned more about how we can use the mocking library for what I needed. And I prefer that at the moment. Because I had seen it, but I didn't consider that it's gonna be useful even in simple cases. So it's like life, how you turn depends. But maybe I should look more into the org mode and the test case to learn more about that. So thanks for pointing that out. It's almost impossible for org. But yeah, we keep adding more tests. Someone's typing. I don't know. Any more questions? No? Okay, then I'll go back and try to document this in the etherpad. Thank you everybody for Take care. Bye-bye.Questions or comments? Please e-mail matsl@gnu.org