What I learned by writing test cases for GNU Hyperbole

Mats Lidell (he, him, his) - IRC: matsl, @matsl@mastodon.acc.sunet.se, matsl@gnu.org

Format: 27-min talk; Q&A: BigBlueButton conference room
Status: Q&A finished, IRC and pad will be archived on this page

00:03.120 Introduction 03:11.160 ERT: Emacs Lisp Regression Testing 04:14.360 Assertions with should 04:56.920 Running a test case 06:54.560 Debug a test 07:50.380 Commercial break: Hyperbole 09:10.480 Instrument function on the fly 10:39.120 Mocking 14:41.240 cl-letf 15:24.100 Hooks 15:55.720 Side effects and initial buffer state 17:05.100 with-temp-buffer 17:16.520 make-temp-file 17:33.288 buffer-string 18:09.920 buffer-name 18:51.980 major-mode 19:02.680 unwind-protect 20:15.100 Input, with-simulated-input 21:38.460 Running all tests 23:03.220 Batch mode 24:05.060 Skipping tests 26:08.460 Conclusion

Duration: 26:55 minutes

Description

I'm maintaining GNU Hyperbole. I volunteered for that at a time when FSF was asking for one since it was unmaintained. I did not have much elisp experience but a passion for the package. Not much happened.

To my great delight a few years ago the author of Hyperbole Bob Weiner joined the band and we started together to actively develop Hyperbole again.

One of my focus areas in that work has been to add test cases. We have now gone from no tests to over 300 ert tests for the package. This talk is about my test case journey. What I have learned by doing that.

Discussion

Questions and answers

  • Q:How many tests do you have for Hyperbole and how wouild you rate the test coverage compared to other packages?
    • A: 
      • With all tests including the interactive we have 354 tests. Havng said that I must point out that the size of the tests can be very different. I tend to split tests so they are logically (in some sense) different. So that if a test fails it will more likely point you to what the error is. This makes it become more tests. Codewise you could collect similar tests to one ert-deftest making the name of the test point out some group or collection of functions, but I don\'t do that!
      • I have not studied other packages so I don\'t know how our test coverage compares to other packages. In fact I don\'t know what code coverage we have. That is another thing to look into.
  • Q: One small suggestion, to me \'should\' means optional, whereas \'shall\' or \'must\' means required. Not sure if it is too late to make a major grammar change like that :) Very nice presentation. (I see :))
    • A: The assertions come from the ert package so any changes would have to be suggested to that. I guess you could make your own version of the assestions using aliases for should et al.
  • Q: FYI, you may find this helpful for running Emacs tests/lints, both from a command line and from within Emacs with a Transient menu: https://github.com/alphapapa/makem.sh  It also works on remote CI.
    • A: Thanks for the suggestion. I did have a look at makem.sh but a long time ago so I don\'t remember why we did not try to apply it. I might give it another look now when I have used plain ert more.
  • Q: Is it easy to run ad hoc tests inside of an Emacs session, given the command line scripts you need to run to get a batch test session running? In other words, can you tweak tests in an Emacs session and run them right away?
    • A: 
      • Yes, in principle you just load your tests and run them all using `ert` and give it the test selector `t`. That runs all loaded tests. 
      • If you want to modify a test you can do that. You change it, evaluate it, and run it again. Just as you change any function.
  • Q: Did you have to change Hyperbole code and design to be more readily testable as you were increasing your test coverage?
    • A: 
      • Yes, we have done that to a small extent but we should do more of that. Some Hyperbole functions are large and by that complicated to test. Splitting them into smaller logical parts can make testing easier. 
      • Also moving code into pure functions and avoid side effects is a good thing. Pure functions are easier to test. Maybe haveing the side effects separated out into fewer places. This has not been applied but is something I have been thinking about. With side effects I here mean things like adding or modifying text in buffers. 
  • Q: What\'s the craziest bug you found when writing these tests?
    • A: This is not a bug but I always assumed giving a prefix argument to a cursor movement would give the same result as hitting the key the same amount of times. So like C-u 2 C-f would be the same as hitting the C-f key twise. It is not! When moving over a hidden area, the three dots \'...\' at the end of folded line in org-mode or outline-mode, you get different behavior. Trying to write a test case for the kotl-mode and its folded behavior teached me that.
  • Q: Why do you prefer el-mock to mocking using cl-letf. (Question asked in BBB)
      • With cl-letf you need to keep track if the mocked functionality is being called or not. The el-mock package does that for you which is what you normally want. Doing this with cl-letf means the definition becomes longer and more complicated. Sort of blurs the picture. el-mock is more to the point.
      • BUT since cl-letf does allow you do define a \"new\" function it is more powerful and it can be the only option in cases where el-mock is too limited. So it is good to know of this possibility with cl-letf when el-mock does not provide what you need.

Notes and discussion

Transcript

[00:00:03.120] Introduction

Hi everyone! I'm Mats Lidell. I'm going to talk about my journey writing test cases for GNU Hyperbole and what I learned on the way. So, why write tests for GNU Hyperbole? There is some background. I'm the co-maintainer of GNU Hyperbole together with Bob Weiner. Bob is the author of the package. The package is available through the Emacs package manager and GNU Elpa if you would want to try it out. The package has some age. I think it dates back to a first release around 1993, which is also when I got in contact with the package the first time. I was a user of the package for many years. Later, I became the maintainer of the package for the FSF. That was although I did not have much knowledge of Emacs Lisp, and I still have a lot to learn. A few years ago, we started to work actively on the package, with setting up goals and having meetings. So my starting point is that I had experience with test automation from development in C++, Java and Python using different x-unit frameworks like cppunit, junit. That was in my daytime work where the technique of using pull requests with changes backed up by tests were the daily routine. It was really a requirement for a change to go in to have supporting test cases. I believe, a quite common setup and requirement these days. I also had been an Emacs user for many years, but with focus on being a user. So as I mentioned, I have limited Emacs Lisp knowledge. When we decided to start to work actively on Hyperbole again, it was natural for me to look into raising the quality by adding unit tests. This also goes hand in hand with running these regularly as part of a build process. All in all, following the current best practice of software development. But since Hyperbole had no tests at all, it would not be enough just to add tests for new or changed functionality. We wanted to add it even broader; ideally, everywhere. So work started with adding tests here and there based on our gut feeling where it would be most useful. This work is still ongoing. So this is where my journey starts with much functionality to test, no knowledge of what testing frameworks existed, and not really knowing a lot about Emacs Lisp at all.

[00:03:11.160] ERT: Emacs Lisp Regression Testing

Luckily there is a package for writing tests in Emacs. It is called ERT: Emacs Lisp Regression Testing. It contains both support for defining tests and running them. Defining a test is done with the macro ert-deftest. In its simplest form, a test has a name, a doc string, and a body. The doc string is where you typically can give a detailed description of the test and has space for more info than what can be given in the test name. The body is where all the interesting things happen. It is here you prepare the test, run it and verify the outcome. Schematically, it looks like this. You have the ert-deftest, you have the test name, and the doc string, and then the body. It is in the body where everything interesting happens. The test is prepared, the function of the test is executed, and the outcome of the test is evaluated. Did the test succeed or not?

[00:04:14.360] Assertions with should

The verification of a test is performed with one or more so-called assertions. In ERT, they are implemented with the macro should together with a set of related macros. should takes a form as argument, and if the form evaluates to nil, the test has failed. So let's look at an example. This simple test verifies that the function + can add the numbers 2 and 3 and get the result 5.

[00:04:56.920] Running a test case

So now we have defined a test case. How do we run it? The ERT package has the function (or rather convenience alias) ert. It takes a test selector. The test name works as a selector for running just one test. So here we have the example. Let's evaluate it. We define it and then we run it using ERT. As you see, we get prompted for a test selector but we only have one test case defined at the moment. It's the example 0. So let's hit RET. As you see here, we get some output describing what we have just done. There is one test case it has passed, zero failed, zero skipped, total 1 of 1 test case and some time stamps for the execution. We also see this green mark here indicating one test case and that it was successful. For inspecting the test, we can hit the letter l which shows all the should forms that was executed during this test case. So here we see that we have the should, one should executed, and we see the form equals to 2, and it was 5 equals to 5. So a good example of a successful test case.

[00:06:54.560] Debug a test

So now we've seen how we can run a test case. Can we debug it? Yes. For debugging a test case, the ert-deftest can be set up using edebug-defun, just as a function or macro is set up or instrumented for debugging. So let's try that. So we try edebug-defun here. Now it's instrumented for debugging. And we run it, ert, and we're inside the debugger, and we can inspect here what's happening. Step through it and yes it succeeded just as before.

[00:07:50.380] Commercial break: Hyperbole

It's time for a commercial break! Hyperbole itself can help with running tests and also help with running them in debug mode. That is because hyperbole identifies the ert-deftest as an implicit button. An implicit button is basically a string or pattern that Hyperbole has assigned some meaning to. For the string ert-deftest, it is to run the test case. You activate the button with the action-key. The standard binding is the middle mouse button, or from the keyboard, M-RET. So let's try that. We move the cursor here and then we type M-RET. And boom, the test case was executed. And to run it in debug mode we type C-u M-RET to get the assist key, and then we're in the debugger. So that's pretty useful and convenient.

[00:09:10.480] Instrument function on the fly

A related useful feature here is the step-in functionality bound to the letter i in debug-mode. It allows you to step into a function and continue debugging from there. For the cases where your test does not do what you want, looking at what happens in the function of the test can be really useful. Let's try that with another example. So here we have two helper functions, one f1-add, that use the built-in + function and then we have my-add that uses that function. So we're going to test myadd. And then let's run this. Let's run this using hyperbole in debug mode C-u M-RET. We're in the debugger again, and let's step up front to my function under test and then press i for getting it instrumented and going into it for debugging. And here we can expect that it's getting the arguments 1 and 3, and it returns the result 4 as expected. And yes, of course, our test case will then succeed.

[00:10:39.120] Mocking

The next tool in our toolbox is mocking. Mocking is needed when we want to simulate the response from a function used by the function under test. That is the implementation of the function. This could be for various reasons. One example could be because it would be hard or impossible in the test setup to get the behavior you want to test for, like an external error case. But the mock can also be used to verify that the function is called with a specific argument. We can view it as a way to isolate the function on the test from its dependencies. So in order to test the function in isolation, we need to cut out any dependencies to external behavior. Most obvious would be dependencies to external resources, such as web pages. As an example: Hyperbole contains functionality to link you to social media resources and other resources on the net. Testing that would require the test system to call out to the social media resources and would depend on it being available, etc. Nothing technically stops a test case to depend on the external resources, but would, if nothing else, be flaky or slow. It could be part of an end-to-end suite where we want to test that it works all the way. In this case, we want to look at the isolated case that can be run with no dependency on external resources. What you want to do is to replace the function with a mock that behaves as the real function would do. The package I have found and have used for mocking is el-mock. The workhorse in this package is the with-mock macro. It looks like this: with-mock followed by a body. In the execution of the body, stubs and mocks defined in the body is respected. Let's look at some examples to make that clearer. In this case, we have the macro with-mock. It works so that the expression stub + => 10 is interpreted so that the function + will be replaced with the stub. The stub will return 10 regardless how it is called. Note that the stub function does not have to be called at this level but could be called at any level in the call chain. By knowing how the function under test is implemented and how the implementation works, you can find function calls you want to mock to force certain behavior that you want to test, or to avoid calls to external resources, slow calls, etc. Simply isolate the function under test and simulate its environment. Mock is a little bit more sophisticated and depends on the arguments that the mock function is called with. Or more precise, it is checked after the with-mock clause that the arguments match the arguments it was called with or even if it was called at all. If it is called with other arguments there will be an error, and if it's not called, it is also an error. So this way, we are sure that the function we were expected to be called actually was called. An important piece of the testing. So we are sure that the mock we have provided actually is triggered by the test case. So here we have an example of with-mock where the f1-add function is mocked, so that if it's called with 2 and 3 as arguments, it will return 10. Then we have a test case where we try the my-add function, as you might remember, and call that with 2 and 3 and see that it should also then return 10 because it's using f1-add.

[00:14:41.240] cl-letf

Moving over to cl-letf. In rare occasions, the limitations of el-mock means you would want to implement a full-fledged function to be used under test. Then the macro cl-letf can be useful. However, you need to handle the case yourself if the function was not called. Looking through the test cases where I have used cl-letf, I think most can be implemented using plain mocking. Cases left is where the args to the mock might be different due to environment issues. In that case, a static mock will not work.

[00:15:24.100] Hooks

Another trick is that functions that uses hooks. You can overload or replace the hooks to do the testing. So you can use the hook function just to do the verification and not do anything useful in the hook. Also, here you need to be careful to make sure the test handler is called and nothing else.

[00:15:55.720] Side effects and initial buffer state

So far we have been talking about testing and what the function returns. In the best of words, we have a pure function that only depends on its arguments and produces no side effects. Many operations produce side effects or operate on the contents of buffers such as writing a message in the message buffer, change the state of a buffer, move point etc. Hyperbole is not an exception. Quite the contrary. Much of the functions creating links are just about updating buffers. This poses a special problem for tests. The test gets longer since you need to create buffers and files, initialize the contents. Verifying the outcome becomes trickier since you need to make sure you look at the right place. At the end of the test, you need to clean up, both for not leaving a lot of garbage in buffers and files around, and even worse, not cause later tests to depend on the leftovers from the other tests. Here are some functions and variables I have found useful for this.

[00:17:05.100] with-temp-buffer

For creating tests: with-temp-buffer: it provides you a temp buffer that you visit, and afterwards, there is no need to clean up. This is the first choice if that is all you need.

[00:17:16.520] make-temp-file

make-temp-file: If you need a file, this is the function to use. It creates a temp file or a directory. The file can be filled with initial contents. This needs to be cleaned up after a test. Moving on to verifying and debugging:

[00:17:33.288] buffer-string

buffer-string: returns the full contents of the buffer as a string. That can sound a bit voluminous, but since tests are normally small, this often works well. I have in particular found good use of comparing the contents of buffers with the empty string. That would give an error, but as we have seen with the output produced by the should assertion, this is almost like a print statement and can be compared with the good old technique of debugging with print statements. There might be other ways to do the same as we saw with debugging.

[00:18:09.920] buffer-name

buffer-name: Getting the buffer name is good to verify what buffer we are looking at. I often found it useful to check that my assumptions on what buffer I am acting on is correct by adding should clauses in the middle of the test execution or after preparing the test input. Sometimes Emacs can switch buffers in strange ways, maybe because the test case is badly written, and making sure your assumptions are correct is a good sanity check. Even the ert package does some buffer and windows manipulation for its reporting that I have not fully learned how to master, so assertion for checking the sanity of the test is good.

[00:18:51.980] major-mode

Finally, major-mode: Verify the buffer has the proper mode. Can also be very useful and is a good sanity check.

[00:19:02.680] unwind-protect

Finally, cleaning up. unwind-protect. The tool for cleaning up is the unwind-protect form which ensures that the unwind forms always are executed regardless of the outcome of the body. So if your test fails, you are sure the cleanup is executed. Let's look at unwind-protect together with the temporary file example. Many tests look like this. You create some resource, you call unwind-protect, you do the test, and then afterwards you do the cleanup. The cleanup for a file and a buffer is so common, so I have created a helper for that. It looks like this. The trick with the buffer-modified flag is to avoid getting prompted for killing a buffer that is not saved. The test buffers are often in the state where they have not been saved but modified.

[00:20:15.100] Input, with-simulated-input

Another problem for tests are input. In the middle of execution a function might want to have some interaction with the user. Testing this poses a problem, not only in that the input matters, but also as how even to get the test case to recognize the input!? Ideally the tests are run in batch mode, which in some sense means no user interaction. In batch mode, there is no event loop running. Fortunately, there is a package with-simulated-input that gets you around these issues. This is a macro that allows us to define a set of characters that will be read by the function under the test, and all of this works in batch mode. It looks like this. We have with-simulated-input, and then a string of characters, and then a body. The form takes a string of keys and runs the rest of the body, and if there are input required, it is picked from the string of keys. In our example, the read-string call will read up until RET, and then return the characters read. As you see in the example, space needs to be provided by the string SPC, as return by the string RET.

[00:21:38.460] Running all tests

So now we have seen ways to create test cases and even make it possible to run some of them that has I/O in batch mode. But the initial goal was to run them all at once. How do you do that? Let's go back to the ert command. It prompts for a test selector. If we give it the selector t, it will run all tests we have currently defined. Let's try that with the subset of the Hyperbole tests. Here is the test folder in the Hyperbole directory. Let's go up here and load all the demo tests. And then try to run ert. Now we see that we have a bunch of test cases. We can all run them individually, but we can run them with t instead. We will run them all at once. So now, ert is executing all our test cases. So here we have a nice green display with all the test cases.

[00:23:03.220] Batch mode

So that was fine, but we were still running it manually by calling ert. How could we run it from the command line? Ert comes with functions for running it in batch mode. For Hyperbole, we use make for repetitive tasks. So we have a make target that uses the ert batch functionality, and this is the line from the Makefile. This is a bit detailed, but you see that we have a part here where we load the test dependencies. For getting the packages such as el-mock and with-simulated-input etc. loaded. We also have... I also want to point out here the call to or the setting of auto-save-default to nil to get away with the prompt for excessive backup files that can pile up after running the tests a few times.

[00:24:05.060] Skipping tests

Even with the help of simulated input, not all tests can be run in batch mode. They would simply not work there and have to be run in an interactive Emacs with the running event loop. One trick still to be able to use batch mode for automation is to put the guard at the top of each test case as the first thing to be executed, so that it kicks in before anything else and stops Emacs to try to run the test case. Now, it looks like this: (skip-unless (not noninteractive)). So when ert sees that the test should be skipped, it skips it and makes a note of that, so you will see how many tests that have been skipped. Too bad. We have a number of test cases defined, and to run them, we need to run them manually. Well sort of. Not being able to run all tests easily is a bit counterproductive since our goal is to run all tests. There is however no ert function to run tests in batch mode with an interactive Emacs. The closest I have got is either to start the Emacs from the command line calling the ert function as we just have seen, and then killing it manually when done; or add a function to extract the contents of the ERT buffer when done and echo it to standard output. This is how it looks in the Makefile to get the behavior of cutting and paste, getting the ERT output into a file so we can then kill Emacs and spit out the content of the ERT buffer. One final word here is that when you run this in a continuous integration pipeline, you might not have a TTY for getting Emacs to start, and that is then another problem with getting the interactive mode.

[00:26:08.460] Conclusion

We have reached the end of the talk. If you have any new ideas or have some suggestions for improvements, feel free to reach out because I am still on the learning curve of writing, how to write good test cases. If you look at the test cases we have in Hyperbole and you think they might contradict what I am saying here, it is OK. It is probably right. I have changed the style as I go and we have not yet refactored all tests to benefit from new designs. That is also the beauty of the test case. As long as it serves its purpose, it is not terrible if it is not optimal or not having the best style. And yes, thanks for listening. Bye.

Questions or comments? Please e-mail matsl@gnu.org