Software testing isn't quality assurance

Why hiring software developers to write great software tests (unit tests) are not the same as performing quality assurance.

Samuel Chan, Software Engineer

Good Software Testing

Edsger Wybe Dijkstra is an influential computer scientist. He was perhaps most popularly known for his conception of the Dijkstra’s algorithm (pronounced “dike-struh”) which finds the shortest paths between nodes in a graph. It has wide-ranging applications, from areas like logistical and road networks to the less physical: digital forensics, social studies (links between terrorists; government structures) and counter-fraud intelligence. In 1972, he became the first person who was neither American nor British to win the Turing Award.

Program testing shows the presence, but never the absence of bugs
— Edsger Wybe Dijkstra

On the topic of software testing, Dijkstra once quipped that program testing can be used to show the presence of bugs, but never to show their absence.

And such is the strength of this axiom, that it presents somewhat of a cyclical paradox in and of itself. Software engineering teams write good, exhaustive tests but the most valuable tests should reveal flaws in the software design. If the software engineers were able to write tests that reveal these bugs, wouldn’t they have the same mental clarity to prevent the bugs from creeping in in the first place?

What the axiom posits, is that software teams (data scientists, data engineers, software developers etc) cannot prove the absence of bugs, much like you cannot prove the absence of black swan — you can merely falsify it (by spotting the first black swan, in 1697 Western Australia and hence falsifying that presumption).

How to write good tests

A common pattern of software testing is to first check for failed tests, perform code changes that address these failures, and re-run the test suite. This routine implicitly assumes that the software developer is fully aware of all conditions that will trigger an unexpected behavior. It assumes that all such conditions are being tested.

Consider the following, adapted from a command-line tool: TaskQuant that I wrote:

def score_accum(task_path, verbosity=False):
    """
    Create a scoreboard using 'score' attribute of tasks
    """
    tw = TaskWarrior(data_location=task_path)
    completed = tw.tasks.completed()
    total_completed = len(completed)
    
    cl = list()

    for task in completed:
        cl.append(
            (
                task["project"],
                task["end"].date(),
                task["effort"] or "",
                task["score"] or 0,
                task["tags"],
            )
        )

    # sort cl by ["end"] date
    cl_sorted = sorted(cl, key=lambda x: x[1])

    agg_date = [
        [k, sum(v[3] for v in g)] for k, g in groupby(cl_sorted, key=lambda x: x[1])
    ]

    agg_date_dict = dict(agg_date)

    startdate = agg_date[0][0]
    enddate = agg_date[-1][0]

The function above will rightfully and expectedly fail if any of the conditions were met:

any task in completed has less than 4 attributes; the sum(v[3]) will fail with an IndexError: list index out of range
any task in completed has a mising project, end or tags; cl.append((task['tags'])) will raise a KeyError: 'tags'
if the end attribute isn’t a datetime type, it will fail with an AttributeError: 'str' object has no attribute date'
if the fourth attribute isn’t a numeric, sum(v[3] for v in g) will fail with a TypeError: sum() can't sum strings [use ''.join(seq) instead]
verbosity is not used in our function above, so it does not matter. But task_path must point to a valid path and if it isn’t we should expect to raise a NotADirectoryError

All of the above conditions will throw an exception and fail immediately. What’s less obvious are bugs that do not throw an exception (fail quietly) or logical errors that we’ve made that were not caught. For example, consider:

cl_sorted = sorted(cl, key=lambda x: x[1])

We expect the above to sort the list based on the second key (index 1), which refers to date. If we incorrectly passed in index 2, it would have used effort as the key and the program would execute to the end returning an incorrect aggregation and both the end-user and us would be none the wiser.

Supposed you write tests to account for the cases we identified above, each with their own corresponding assertions. We supply a path to our sample task archive through the task_path, run the test suite, and wait for the much coveted green dots in pytest (a green dot represent a passing test):

===================================== test session starts =====================================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/samuel/writing-good-tests
plugins: xdist-1.34.0, hypothesis-6.27.3
collected 5 items                                                                                       

tests/application.py ..... 

====================================== 5 passed in 0.85s ======================================

Hurray! 🎉

We tested all possible scenarios where our tool can possibly break. We have, right?

Not so quick. We’ve merely verified that our tests performed as expected under those very defined circumstances. Just like you can’t present evidence of “no black swans” — you merely disprove it by waiting for somebody to spot the world’s first black swan; As the saying goes, an absence of evidence should not be taken as evidence of absence. And that is how I interpret the essence of Dijkstra’s position on software testing.

At a minimum, consider the following. Is there any other conditions where our tool may have failed with an exception that isn’t expected by our test conditions yet?

There is of course, the fact that any code changes to make to address a bug will introduce new scenarios — and we have to write tests for these scenarios as well. Then there is the second, more undesirable situation that we may also overlooked conditions that lead to our programming throwing an error that we didn’t catch. One such scenario is this:

if total_completed is 0 (the user has no completed tasks), startdate = agg_date[0][0] will fail with an IndexError: list index out of range

This isn’t a scenario that we’ve accounted for in the original test. We have 5 passed tests out of 5, but none of them addresses this scenario. This isn’t the case where a user has passed an invalid path (NotADirectoryError), this is the case where a user has passed in a valid location, with a legitimate source of data, but the user simply has not completed any task yet.

Identifying this, we should then add a 6th unit test, and make the following code changes:

import warnings

  def score_accum(task_path, verbosity=False):
    ...
    if total_completed < 1:
        return warnings.warn(
            f"A curious case of 0 completed tasks. Check to make sure the path to Taskwarrior's .task is set correctly or try to complete some tasks in Taskwarrior!",
        )

Quality assurance is not provable

What all of this goes to show, is that software developers can write tests to:

Check for the existence of bugs
Check for the absence of known bugs
Check that a code change has adequately address the bug based on known conditions

And no amount of software testing can:

Prove the absence of known bugs
Prove the absence of unknwon bugs
Prove that a code change has entirely eliminated the bug

So — the goals of writing great software tests are quite distinctly different from the objectives of quality assurance.

Good software developers know what to test, and what not to. A developer didn’t have to test that cl_sorted is a list before feeding that through to the groupby. We know it is a list from the upstream operation. The following line of you code in your test is hence unneccessary:

assertIsInstance(cl_sorted, list)

If you’re interested in this topic, I have a video where I walk you through how I write unit tests for a Smart Contract. I walk you though the Solidity code line by line (familiarity in Solidity or Ethereum Smart Contract not required), and we write python tests (using pytest) for a 100% coverage, handling expected exceptions as we go. Here is the video:

If you want to see more on building and publishing TaskQuant:

Build w/ Python 1: Command line productivity scoreboard for TaskWarrior

If you need help with software engineering, or any data science work for your project, reach out and we’d love to explore ways on working together.