Don't Write Tests!by Martin Penckert on 18.10.2019
… is the title of a famous talk by John Hughes that got me into Property Based Testing. And that is exactly, what this blog post is about.
Throughout this blog post I will use Erlang code for examples.
At Sandstorm, we have adopted Behavioural Testing and often follow the Behaviour Driven Development approach in writing software. That ensures a product that is well tested and adjusted to the needs of our clients.
In traditional test writing, we always need to come up with examples of success and failure cases.
The unit test for a list reverse function could be that:
reverse_test() -> ?assert( reverse(reverse([1, 2, 3])) =:= [1, 2, 3] ).
But that is only one test case. All we learn is that reverse works correctly with [1, 2, 3]. For all we know, it could be a sort function included or just always return [1, 2, 3]. So let’s check that:
reverse_test() -> ?assert( reverse(reverse([1, 2, 3])) =:= [1, 2, 3] ); ?assert( reverse(reverse([3, 2, 1])) =:= [3, 2, 1]).
If the test still holds we learn that there is most likely no (simple) sort function involved. But for all we know now, reverse only works for lists of three integer values. Let’s have more test cases to check if it works for less then three, more than three, other types than integer numbers and empty lists, and lists of empty lists; and don’t forget about failing cases:
reverse_test() -> ?assert( reverse(reverse([1, 2, 3])) =:= [1, 2, 3] ); ?assert( reverse(reverse([3, 2, 1])) =:= [3, 2, 1]); ?assert( reverse(reverse([1, 2])) =:= [1, 2] ); ?assert( reverse(reverse()) =:=  ); ?assert( reverse(reverse([1, 4, 3, 2])) =:= [1, 4, 3, 2] ); ?assert( reverse(reverse([1.0, 4.2, 3.9, 2.3])) =:= [1.0, 4.2, 3.9, 2.3] ); ?assert( reverse(reverse(["Hello", "World"])) =:= ["Hello", "World"] ); ?assert( reverse(reverse()) =:=  ); ?assert( reverse(reverse([])) =:= [] ); ?assert( reverse(reverse([, ])) =:= [, ] ); ?assert( reverse(reverse([[1, 2], ])) =:= [[1, 2], ] ); ?assert( reverse(reverse([1, 2])) =/= [2, 1] ).
So … this is getting tedious. But we do have all edge cases covered now, have we not? Maybe. What about very long lists? Like with a hundred, a thousand or even more entries? Do we want to write test cases for those?
This is a very simple function to test, though. Think about more complex functions with more logic, more rules and more edge cases. Writing a dozen of test examples per feature of our software that might as well have a couple dozen or hundreds of features will get us to a thousand test cases very quick — and yet we are not sure that we have all edge cases covered.
Now, to get the thrill up, we might experience bugs only when two features run together. Hello, integration tests. And when our test complexity was something like O(n) for unit tests it now increases to O(n²) test cases for integration tests. That seems less fun. To drive this even further — some bugs may only appear when three features are running together. We might end up with a necessity of O(n³ ) test cases to ensure that all possible edge cases are covered. And that is also, where all the fun ends.
And we didn’t even touch parallelism and concurrency, yet.
To handle that much complexity we might decrease the number of test cases per feature as our system adds weight and test only critical cases (add in your mind: that we can think of). Though we know, we should increase the number of test cases as it is more than likely that more features will add complexity and we will introduce more bugs.
How can we escape this mess?
Don’t write tests.
... generate them!
Instead of writing a decreasing number of tests per feature while complexity increases, let’s lean back for a while and think about our features. What do all of these test cases for the reverse/1 function have in common? They take a list and reverse it twice to get back to the original list. This is something, we might state like this:
reverse_test() -> L = generate_random_list(), ?assert( reverse(reverse(L)) =:= L ).
If there would be a function generate_random_list/0 that could give us a list of some random type and length — including the length of zero — we could use this function and let it run a hundred or a thousand times, every time with a new, randomly generated list. And doing so will increase our confidence in the function under test for each passing generated test case.
The very idea behind Property-Based Testing (PBT) is to generate test cases. A lot of them. PBT is not about testing. It’s about finding the right properties.
A property is always true in a given context. For that, it is defined by a rule that dictates behaviour of the code under test. That rule will always be the same, no matter the input. That rule encoded in an executable test is the property.
We have seen such a rule in the example above:
reverse(reverse(L)) =:= L %% for every list L
To get a property out of that rule we need to add context. That context is the definition of a generator that defines, what that list L looks like. We need the magic generate_random_list/0 function!
Of course, we could write that generator function. We have to define what a list is, how it is built up and what the entries might look like. To get those entries we might write a couple of functions to generate random values of all possible types:
-export([ %% generate lists with random type or with a type parameter generate_random_list/0, generate_random_list/1, %% entry generators generate_string/0, generate_number/0, generate_function/0, %% … ]).
We could then extend the generator functions to take some parameters, like maximum and minimum values or sizes, positive or negative numbers, etc.
But doing all of that would be a lot of distracting work, leading away from the very purpose of all this testing effort: writing business software.
That’s where PBT-frameworks come in play. I bet the language of your choice has one. Since my language of choice is Erlang I will make use of PropEr. (You don’t have to know Erlang or PropEr to follow the concepts.)
A (good) PBT-framework comes with a whole bunch of predefined generators and so does PropEr. It provides among others a generator function called list/1 that takes a type as input. Giving it any/0 will produce lists of elements from every other predefined generator (atoms, binaries, booleans, floats, integers, lists, strings, tuples, …). Since list(any()) is a very common generator it has a shortcut: list/0.
We are now able to write a test for all lists:
%% prop_ is a naming convention for property tests in PropEr prop_reverse() -> %% FORALL is a macro that takes three arguments: %% An unbound variable, a generator to bind the variable %% with values for each iteration and a function body %% that is called with that bound variable. ?FORALL(L, list()), begin L =:= reverse(reverse(L)) end).
PropEr will generate some test cases (hundred per default) with that information and assert that the function returns true for each (generated) case.
Diving deeper into Generators and Shrinking (and having a real world example)
A generator does not just return sample data. It returns a complex data structure that holds all kinds of information. E.g. the boundaries given (minimum and maximum values, types of items in lists, etc). It also holds shrinking information. That is information on the structure of the generated data.
If an error occurs and therefore the test fails, the test case might be very hard to reproduce edge case. An example taken from a presentation from John Hughes: he tested the SMS protocol for messages. After a couple of hundred tests one failed. The sample data of that test case was <92, 118, 65, 94, 88, 72, 100, 0>. That is an 8 byte bitstring that doesn’t tell us much at this point. Maybe the 8 byte are key. But we won’t know. The test framework shrank the failing case by applying the structure information held in the generated data structure to get to the simplest failing test case. That is <0, 0, 0, 0, 0, 0, 0, 0>.
John Hughes goes on and refines the property by modelling what he already knows into it. That is, that obviously, all messages up to seven bytes pass and that a message of eight bytes fails. I highly recommend watching the whole video (as I recommend every video of John Hughes, really). But to make things short: after a couple of thousand generated test cases he has figured out, that all messages with a length of eight or a multiple of eight bytes (16 bytes, 24 bytes, …) that end with a 0 are breaking the system. It did take him about ten minutes to get there.
Now, let’s pause for a moment and think about that use case. Try to imagine finding that bug with conventional testing.
The implementation of our business logic has more constraints than just to do the right thing. It should also do the right things fast, with little resources needed and in various other constraints efficient. Testing our program we want to assure that the fast, narrow and efficient code is still doing the right thing. Therefore traditionally we come up with example pairs of input and output or state changes.
In PBT we model a property that is in its implementation very simple and doing the right thing.
A function for finding the biggest element in a (potentially very large) list might be in its time and resource-efficient implementation a complex one. But the corresponding property just might be modelled as:
Modelling is the process of thinking about the problem under test in a simple way. Just to get it right by ignoring all other constraints.
The journey from here
I hope my mission of getting you interested in Property-Based Testing has succeeded and you ask yourself, how to take off from here. I have some resources:
The mentioned talks by John Hughes are a very good appetizer:
Followed by a deep dive into hard problem testing with Jessica Kerr (also follow her on Twitter!):
If you are interested in PBT in Erlang, read Property-Based Testing with PropEr, Erlang, and Elixir.
Fred Herbert also has a webinar online: Testing Erlang and Elixir through PropEr Modeling.
And then, you really should lean back and watch Joe Armstrong, just for the joy of it: The Mess We’re In