Software testing isn't quality assurance
Why hiring software developers to write great software tests (unit tests) are not the same as performing quality assurance.
Samuel Chan, Software Engineer
Good Software Testing
Edsger Wybe Dijkstra is an influential computer scientist. He was perhaps most popularly known for his conception of the Dijkstra’s algorithm (pronounced “dike-struh”) which finds the shortest paths between nodes in a graph. It has wide-ranging applications, from areas like logistical and road networks to the less physical: digital forensics, social studies (links between terrorists; government structures) and counter-fraud intelligence. In 1972, he became the first person who was neither American nor British to win the Turing Award.
Program testing shows the presence, but never the absence of bugs
— Edsger Wybe Dijkstra
On the topic of software testing, Dijkstra once quipped that program testing can be used to show the presence of bugs, but never to show their absence.
And such is the strength of this axiom, that it presents somewhat of a cyclical paradox in and of itself. Software engineering teams write good, exhaustive tests but the most valuable tests should reveal flaws in the software design. If the software engineers were able to write tests that reveal these bugs, wouldn’t they have the same mental clarity to prevent the bugs from creeping in in the first place?
What the axiom posits, is that software teams (data scientists, data engineers, software developers etc) cannot prove the absence of bugs, much like you cannot prove the absence of black swan — you can merely falsify it (by spotting the first black swan, in 1697 Western Australia and hence falsifying that presumption).
How to write good tests
A common pattern of software testing is to first check for failed tests, perform code changes that address these failures, and re-run the test suite. This routine implicitly assumes that the software developer is fully aware of all conditions that will trigger an unexpected behavior. It assumes that all such conditions are being tested.
Consider the following, adapted from a command-line tool: TaskQuant that I wrote:
def score_accum(task_path, verbosity=False): """ Create a scoreboard using 'score' attribute of tasks """ tw = TaskWarrior(data_location=task_path) completed = tw.tasks.completed() total_completed = len(completed) cl = list() for task in completed: cl.append( ( task["project"], task["end"].date(), task["effort"] or "", task["score"] or 0, task["tags"], ) ) # sort cl by ["end"] date cl_sorted = sorted(cl, key=lambda x: x) agg_date = [ [k, sum(v for v in g)] for k, g in groupby(cl_sorted, key=lambda x: x) ] agg_date_dict = dict(agg_date) startdate = agg_date enddate = agg_date[-1]
The function above will rightfully and expectedly fail if any of the conditions were met:
completedhas less than 4 attributes; the
sum(v)will fail with an
IndexError: list index out of range
completedhas a mising
cl.append((task['tags']))will raise a
- if the
endattribute isn’t a
datetimetype, it will fail with an
AttributeError: 'str' object has no attribute date'
- if the fourth attribute isn’t a numeric,
sum(v for v in g)will fail with a
TypeError: sum() can't sum strings [use ''.join(seq) instead]
verbosityis not used in our function above, so it does not matter. But
task_pathmust point to a valid path and if it isn’t we should expect to raise a
All of the above conditions will throw an exception and fail immediately. What’s less obvious are bugs that do not throw an exception (fail quietly) or logical errors that we’ve made that were not caught. For example, consider:
cl_sorted = sorted(cl, key=lambda x: x)
We expect the above to sort the list based on the second key (index 1), which refers to
date. If we incorrectly passed in index 2, it would have used
effort as the key and the program would execute to the end returning an incorrect aggregation and both the end-user and us would be none the wiser.
Supposed you write tests to account for the cases we identified above, each with their own corresponding assertions. We supply a path to our sample task archive through the
task_path, run the test suite, and wait for the much coveted green dots in
pytest (a green dot represent a passing test):
===================================== test session starts ===================================== platform linux -- Python 3.8.10, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 rootdir: /home/samuel/writing-good-tests plugins: xdist-1.34.0, hypothesis-6.27.3 collected 5 items tests/application.py ..... ====================================== 5 passed in 0.85s ======================================
We tested all possible scenarios where our tool can possibly break. We have, right?
Not so quick. We’ve merely verified that our tests performed as expected under those very defined circumstances. Just like you can’t present evidence of “no black swans” — you merely disprove it by waiting for somebody to spot the world’s first black swan; As the saying goes, an absence of evidence should not be taken as evidence of absence. And that is how I interpret the essence of Dijkstra’s position on software testing.
At a minimum, consider the following. Is there any other conditions where our tool may have failed with an exception that isn’t expected by our test conditions yet?
There is of course, the fact that any code changes to make to address a bug will introduce new scenarios — and we have to write tests for these scenarios as well. Then there is the second, more undesirable situation that we may also overlooked conditions that lead to our programming throwing an error that we didn’t catch. One such scenario is this:
total_completedis 0 (the user has no completed tasks),
startdate = agg_datewill fail with an
IndexError: list index out of range
This isn’t a scenario that we’ve accounted for in the original test. We have 5 passed tests out of 5, but none of them addresses this scenario. This isn’t the case where a user has passed an invalid path (
NotADirectoryError), this is the case where a user has passed in a valid location, with a legitimate source of data, but the user simply has not completed any task yet.
Identifying this, we should then add a 6th unit test, and make the following code changes:
import warnings def score_accum(task_path, verbosity=False): ... if total_completed < 1: return warnings.warn( f"A curious case of 0 completed tasks. Check to make sure the path to Taskwarrior's .task is set correctly or try to complete some tasks in Taskwarrior!", )
Quality assurance is not provable
What all of this goes to show, is that software developers can write tests to:
- Check for the existence of bugs
- Check for the absence of known bugs
- Check that a code change has adequately address the bug based on known conditions
And no amount of software testing can:
- Prove the absence of known bugs
- Prove the absence of unknwon bugs
- Prove that a code change has entirely eliminated the bug
So — the goals of writing great software tests are quite distinctly different from the objectives of quality assurance.
Good software developers know what to test, and what not to. A developer didn’t have to test that
cl_sorted is a
list before feeding that through to the
groupby. We know it is a list from the upstream operation. The following line of you code in your test is hence unneccessary:
If you’re interested in this topic, I have a video where I walk you through how I write unit tests for a Smart Contract. I walk you though the Solidity code line by line (familiarity in Solidity or Ethereum Smart Contract not required), and we write python tests (using
pytest) for a 100% coverage, handling expected exceptions as we go. Here is the video:
If you want to see more on building and publishing TaskQuant:
- Build w/ Python 1: Command line productivity scoreboard for TaskWarrior
If you need help with software engineering, or any data science work for your project, reach out and we’d love to explore ways on working together.