Tonight on my way to teach a test automation course at the University of Washington I had some free time to catch up on my reading. My manager asked me if I had read this month’s copy of one of the several testing magazines we get and I replied that I had downloaded it but hadn’t had a chance to read it yet. So, he tossed me the hardcopy of the magazine and said, “Enjoy.” Now this should have been a clue because although Alan is a great manager and mentor, I think he secretly likes to see the veins in my neck swell and blood shoot out of my eyes from time to time.
I read a lot of articles, white papers, and books. I like most of what I read, even if I disagree with some of the points being made. I can’t remember ever reading an article on software testing that ever made me angry. I was not angry because of the message of the article. In fact, I think the point the authors are trying to make is valid and I agree with them on their fundamental point. Unfortunately, the article is filled with technical inaccuracies the end message was almost lost.
I spent the last 10 years studying various techniques, methods, and approaches in software testing. I teach more than 500 testers a year on structural testing techniques, and am now working with a team in the Windows division to implement a new tool for just in time code coverage analysis at the component level that allows us to see how our tests exercise code paths in changed code and the dependent modules. I also discuss structural testing in chapter 5 of our book How We Test Software At Microsoft. I don’t really consider myself to be an expert in the subject, but I might know a thing or two about it. So, let’s Reconsider Code Coverage!
In August 2007 I wrote an informative blog post on the potential misuse of the code coverage measure. But code coverage measures are used by some companies as one of many ways to help them reduce risk. And, let me be very clear here, there is no correlation between code coverage and quality, and code coverage measures don’t tell us “how well” the code was tested. The code coverage measure simply measures what code has been executed, and more importantly what code has not been executed. The value of measuring code coverage is not in producing some “magic number,” but that it helps testers investigate untested or under-tested areas of the product and design additional tests (generally using structural testing techniques) to improve coverage and reduce overall risk.
Just because you execute a line of code doesn’t mean a bug doesn’t still exist, but if you don’t execute a line of code you have 0 probability of finding a bug if one exists!
Also it is important to note there are several ways to measure code coverage. Different tools employ different measures and sometimes different tools measure the same type of coverage differently. Also, I discovered that even the same tool can measure the same code differently depending on how it is compiled (debug, retail, etc.) and previously wrote about my study. Some of the basic ways to measure code coverage (not test coverage) include:
- Function coverage measures the percentage of functions or methods in a class or application that are called at runtime.
- Statement coverage measures the percentage of executable statements exercised at runtime.
- Block coverage measures the percentage of each sequence of non-branching statements that are executed at runtime. Block coverage subsumes statement coverage.
- Decision or branch coverage measures the percentage of both Boolean (not binary) outcomes (true and false) of simple conditional expressions at runtime. If a predicate statement has more than one conditional sub-expression decision (or branch) coverage treats that predicate statement as one conditional clause. Decision coverage subsumes block coverage.
- Condition coverage measures the percentage of both Boolean outcomes of each conditional sub-expressions that are separated by logical and or logical or in compound predicate statements. Condition coverage subsumes decision coverage.
- Basis path coverage measures the number of linearly independent paths through a program. Basis path coverage is based on McCabe’s cyclomatic complexity research.
- Path coverage measures every possible path from the entry to the return statement (or exception) or exit of every method. Unfortunately path testing is usually impossible due to the sheer number of path combinations, and the inability to execute constrained path combinations.
Clearly there are different measures of code coverage, and certain types of measures subsume other measures. So, now that we have a handle on the different types of code coverage measures, let’s look at testing some code. We will use the same pseudo code used in the aforementioned article which is based upon the following requirement.
“Student ID’ are seven digit numbers between one million and 6 million inclusive.”
The authors provided the following pseudo code example for a function to meet this requirement.
function validate_studentid(string sid) return
STATIC TRUEFALSE isOk;
isOk = true;
if ((length(sid) is not 7) then
isOk = False;
if (number(sid) <= 1000000 or number(sid) > 6000000 then
isOk = False;
So, other than the fact that there is no reason to ‘test’ the length of the sid variable before evaluating it to see if it is within the allowable range (removing this first conditional improves performance and also improves testability of the code), and that if the call to the number() function fails to convert the string to a number for a valid Boolean comparison it will throw an unhandled exception, let’s look at path testing of this simple example by starting with control flow diagrams of each possible path (assuming the call to the number() function does not throw an unhandled exception by passing this message a string of characters such as “foo” rather than a string of digits).
(Edited 11/25: After thinking about this a bit more, if the number() function returned a 0 (zero) if the input was incorrectly formatted, then the number() function would not throw an exception, and the control flow path would be identical to the first test in the table below).
Because we are doing path coverage testing and not decision testing, we actually have to separate each Boolean conditional sub-expression in the second compound predicate statement if (number(sid) <= 1000000 or number(sid) > 600000. The example in the article treated both sub-expressions in the compound predicate statement as a single Boolean expression which would be synonymous with decision coverage. Path coverage actually treats each sub-expression as if there were 2 single Boolean conditions such as
if (number(sid) <= 1000000
isOk = False;
if number(sid) > 600000
isOk = False;
The table below illustrates the tests required for testing control flow through this function for path coverage (again assuming we are going to ignore the unhandled exception in the code that would occur by passing in a string such as “foo.”)
number <= 1mill
number > 6mil
The first test would be a value less than 7 digits, and would cause all Boolean conditional expressions to evaluate as true which will set the isOk variable to false (3 times), and we correctly return the expected result of false (or invalid ID). The second test is a number greater than 6,000,000 (but less than the maximum value that would result in an unhandled overflow exception hopefully being thrown by the number() function). In this case the 3rd conditional expression (if (number(sid) > 6000000) would evaluate as true and the function would return false. The 3rd path is buggy. In this pseudo code example, the only possible way to exercise the true outcome of the Boolean condition if (number(sid) <= 1000000 is to use the value of 1,000,000; any other value larger or smaller will cause this Boolean condition to evaluate as false. In this case we expect the function to return true, but it in fact will return false. Finally, any number greater than 1000001 and less than or equal to 6000000 will return a true result indicating a valid student ID.
The article also suggest that structural testing misses other problems. But, when we look at these issues, they actually have nothing to do with structural testing of the function; in other words they are completely out of context of the problem being discussed.
For example, the assert is the requirement is incorrect and should have read 6,999,999 (which I believe is a typo and should be 5,999,999) because of confusion over the word “inclusive.” Inclusive means “including the stated limit or extremes in consideration or account,” but in computing inclusive means “the predicate holds for all elements of an increasing sequence then it holds for their least upper bound.” I disagree with this assumption because I suspect the analyst writing the spec is basing the inclusive range on the common definition, and not a definition based on domain theory.
The article questions what would occur with incorrectly formatted numbers such as 123 456 789 or 123,456,789. So, beside the point that these values are not within the valid range of student id numbers, the answer to the question would actually lie in how the number() function being called handles improperly formatted numbers (e.g throwing a format exception, which again is unhandled in our validate_studentid() function), or how an event handler that sits between the UI and the function might deal with invalid or incorrectly formatted inputs.
The next question concerned resizing of the input window or the screen (assuming desktop resolution) and repainting the window or form and its affect on code coverage of the validate_studentid() function. Well, I am going out on a limb here and I am going to say…”what are you talking about?” I am not quite sure how to phrase this, but let me try…resizing or repainting a window has 0 effect on the structural control flow of the validate_studentid() function. (Of course, I could be wrong, and the length() function number() function might have some code that mysteriously interacts with the repainting libraries and how it determines the length of a string or whether a string is a valid number.)
Bugs in external libraries are part of the business. Hopefully those external libraries are well tested or at least documented especially if our development team wrote them. Personally, I have not encountered any public functions or APIs which use wild ass random numbers such as 5.8 million as boundary values, but that’s not to say it couldn’t happen. And of course, if these external functions throw exceptions (as they should based on what they are probably doing), we should have exception handler code in our function to deal with any exceptions thrown from external libraries or function calls.
Based on incorrect path analysis, and out-of-context questions that have nothing to do with control flow through the validate_studentid() function the article suggests that path testing is not a magic potion, but I am not too sure that anyone actually believes it is. And so, the article suggests that “input combinatorics coverage” might work better. Hmm…now I have been teaching combinatorial testing for over 10 years and have read some interesting papers on the effectiveness of combinatorics on statistical testing and code coverage, and I must say I pretty sure you need more than one input parameter in combinatorial testing!
Finally, I don’t agree that code coverage measures tell us “how well the developers have tested their code.” The code coverage measure only tells us what percentage of the code has been executed in a particular way, and more importantly it tells us how what percentage of code has been untested. We must determine whether we need to investigate that area to reduce risk. Of course, many code coverage tools provide a “heat map” that helps us and developers identify untested code, and that is where we shift from the simple act of measuring coverage to the testing method of code coverage analysis in order to design new tests that effectively exercise previously untested code if that level of coverage is important to reduce overall risk.
My intent here is not to ridicule the authors of the article. In fact, I agree with their summation that testers should not believe high code coverage numbers mean “well tested.” (Again see my blog post from Aug 2007.) Unfortunately, the path to the point was fraught with inaccuracies and tangents that I almost never made it to the end.
There are many books and white papers on this subject in the ACM and IEEE libraries. Books by Boris Beizer, Robert Binder, and others go into great detail on structural testing. McCabe’s papers linked to in this post are an excellent resources.
OK…I feel better now. I need to clean up the blood, take a sedative, and go to sleep.