Archive for the ‘Testing Practices’ Category
Code Coverage: Finding missing tests
[NOTE: This was written last week but due to a glitch did not get automatically posted before I left on a boat trip where I disconnected from the world. How refreshing…but more about that later.]
Well, we got a buy in the quarter final play-offs, and we won our semi-final game against the Sockeyes (2 – 0) last Tuesday. Unfortunately, while celebrating with my teammates I was literally knocked to the floor by a kidney stone in my bladder. After a trip to the hospital I am now taking a battery of pain-killers until I pass the stone. Unfortunately, this put me on the injured roster and knocked me out of the final game and until I get an OK from the doctor. Fortunately, I am not prone to kidney stones and this shall pass (literally). I have only had one kidney stone in the past and that was about 20 years ago. For those of you who have not experienced a kidney stone, trust me on this…be very, very glad and I hope you never experience this malady.
Last week I attempted to to illustrate how we might achieve high levels of code coverage (structural control flow) but potentially overlook critical tests, especially from a ‘black-box’ testing approach. The bottom line message…high code coverage does not necessarily equal good test coverage. In reality it is really unlikely to get 100% measured code coverage of an reasonably complex application under test. Unfortunately this often begs the question, “What is the right amount of code coverage for [my] application?” To which I have heard several leads/managers reply, “Our goal is 80% code coverage?” Really? C’mon…that’s just plain ROMA data. Setting arbitrary goals for code coverage is about as pointless as tits on a boar hog. The real answer is that we simply don’t know what measure of code coverage is the ideal level for any product.
Also, for those who have read this article will know that regardless of the testing approach at some point the effectiveness of our tests to hit untested areas of code diminishes. While more time and effort may increase data flow coverage and expose issues, it is unlikely to increase control flow (code) coverage. Remember, just because you exercise a line of code doesn’t mean you found all the bugs, but you have 0% probability of finding any bugs in any untested code.
Fortunately, the majority of customers will traverse the same code paths covered by many of our tests. Also, if your team/organization does robust unit testing then there is a good probability that unit tests at least provided some minimal level of code coverage. (NOTE: While I highly recommend that unit tests and coverage results should be transparent to the test team, I do not recommend using unit tests as part of the battery of tests designed by the test team and executed against the whole build to measure code coverage.) So, there are a couple of questions we have to ask ourselves.
“Does the untested code present significant risk to our customers?”
“Do we need to reduce exposure to risk in the untested areas of code?”
“What is the most efficient way to effectively evaluate the untested code?”
As I wrote last week, code coverage is not about the number. Code coverage is about analyzing the results and potentially designing additional functional tests, or at least being able to explain why areas of the code are untested. If we determine that it is important for our business to better understand the untested code, or improve overall confidence and reduce potential risk then we should use a tool to measure code coverage. But, again it is not about simply measuring code coverage and reporting some magical metric.
Code coverage analysis is the most efficient method to help testers evaluate untested code. Code coverage analysis basically involves the tester reviewing untested code reported by the code coverage tool and determining why some code was not exercised, and possibly design additional tests to exercise the previously untested code. (Remember I also wrote last week the future of professional testing is about analyzing information and designing tests…so, here we go!)
Missing tests
For several years I’ve used the triangle simulation to help set a ‘test effectiveness’ baseline for new testers who had never been formally trained in different test techniques, patterns, or approaches. After a few years of analyzing the results we found that there was about a 70 to 75% probability of tests exercising true branch of the first conditional expression in a compound predicate statement in a key method in the program. There was about a 20 to 25% chance of tests exercising the true branch of the second conditional expression, and there was less than a 10% probability of tests exercising the third conditional expression in the predicate statement. When I found these results I could hardly believe it, so I changed the third conditional expression to inject a bug and sure enough the results held true; in any class of 20 people on average only 1 or 2 people found the bug in the software.
From a black box approach let’s say our tests used the following values for sides A, B, and C respectively:
- 1, 2, 3 – an error message indicating the values would not produce a triangle
- 2, 1, 3 – an error message indicating the values would not produce a triangle
- 4, 5, 6 – scalene triangle
- 2, 1, 2 – isosceles triangle
- 5, 5, 5 – equilateral triangle
In this case our code coverage tool would report our coverage is less than 100%. As we drill down we see that the IsValidTriangle() method illustrated below is not completely covered. So, (assuming the arguments values passed to the parameters in this method are all validated to be greater than 0) we analyze the code below in our coverage tool and realize that we need a test to evaluate the third conditional expression to true (e.g. 1, 3, 2 for sides A, B, and C respectively).
1: internal bool IsValidTriangle(int sideA, int sideB, int sideC)
2: {
3: bool result = true;
4: if ((sideA + sideB <= sideC) || (sideB + sideC <= sideA) || (sideA + sideC <= sideB))
5: {
6: result = false;
7: }
8:
9: return result;
10: }
Code Coverage: Did we run the right tests?
It has been awhile since my last post, and I apologize for the few folks who follow my rants. Over the last few months I have been busy working on my boat (she is almost 30 years old now and needs some major refitting). Since starting to play hockey again I have been spending a lot of time at the gym and on the ice and trying to keep my body from getting too battered in the games (for I too am older now…let’s just say more than 30). I have also done a bit of soul searching on the past year or so reflecting on the high points and the low points as well.
So, the refitting on the boat is almost complete and she is ready for a nice long trip to the Gulf Islands starting next week. The Monarch’s Div 7A team finished the regular summer hockey season in first place and playoffs are this week. Winter season doesn’t start until October so I have a month and a half to recuperate. And, I have contemplated a few things to help nurture my professional and personal growth that I will proactively begin working on…beginning with a vacation to the Gulf Islands next week.
But, for now I want to talk more about code coverage. Not the number! Forget the measure! The code coverage percentage is simply a magic metric for pointy-haired managers and other thoughtless chowderheads who like to wave it around as if it actually means something. Look…as a manager I really don’t give a rat’s butt about what percentage of coverage was achieved by testing. If you tell me that your testing achieved 80% code coverage I will probably say, “That’s great! Now tell me about the 20% of the code that you didn’t test!” As a manager what I want to know is:
- Did we run the right tests? Did the test suites we ran give us the information that we need to make good business decisions?
- What is my potential exposure to risk in the areas of untested code and how do/can we reduce that risk?
I have said for years now that the future of professional testing is not simply about beating on a product and finding bugs. The future of testing lies in our ability to design effective tests and critically analyze results in order to provide better information to our primary customers (the decision makers). Code coverage is about analyzing the results and potentially designing additional functional tests, or at least being able to explain why areas of the code are untested.
Did we run the right tests?
The code coverage metric simply measures the number of statements in the code have been exercised by monitoring control flow through the code. Control flow through code is sequential until it hits some type of branch such as an if statement, a for loop, or an exception. For example, when we call the CalculateMonthlyMortgage method below control flows sequentially from the statement in line 6 to the statement in lines 7 and 8 and finally to the return statement in line 9.
1: public static double CalculateMonthlyMortgage(
2: double principle,
3: double annualInterest,
4: int months)
5: {
6: double monthlyInterest = (0.01 * annualInterest) / 12;
7: double monthlyPayment = principle * monthlyInterest /
8: (1 - Math.Pow((1 + monthlyInterest), (-1) * months));
9: return monthlyPayment;
10: }
So, by passing in a value of 250000 for the principle, 4.5 for the annual interest rate, and 360 for the months parameter the method would return an expected value of 1266.71327456471 and we would measure 100% coverage of this method. That’s a happy path test. Yes, it does what we think it is supposed to do. But, what happens when we pass in a value of 0 to the annualInterest parameter? Control flows through the method as described above giving us 100% code coverage, but the return value is NaN, or Not a Number. Similarly, if the value for the months parameter is 0 we get 100% code coverage and the return value is Infinity. Of course, negative values for any of the parameters would also produce undesirable output results.
This example illustrates sequential control flow through a method that contains simple statements. When branching conditions occur in software the control flow through the method becomes a bit more complicated as does the testing required. The following snippet counts the number of characters in a string. But, to prevent a null reference exception we use a predicate or conditional statement in line 4 that branches control flow from line 4 to line 12 if the input argument is null, and from line 4 to the loop structure starting in line 6 if the string is not null. If the input argument is an empty string control will flow from line 6 to line 10 bypassing the inner block that increments the count variable. But, if the input argument is a string of at least 1 character control flows from line 6 to line 8 and loops back to line 6 until all characters in the string are counted. (This previous post shows how you can create a model or control flow graph to map control flow through an algorithm.)
1: public static int CharacterCounter(string input)
2: {
3: int count = 0;
4: if (input != null)
5: {
6: foreach (char c in input)
7: {
8: count++;
9: }
10: }
11:
12: return count;
13: }
These are just 2 simple examples designed to illustrate how we can get high levels of code coverage yet still have bugs lurking in our software. Code coverage tools tell us whether we exercised code; it doesn’t tell us if we ran the right set of tests to expose potential issues.
Next week let’s continue this with some thoughts on analyzing code coverage results to help reduce risk.
Should we use boundary values in our combinatorial tests?
I have been a little busy at work lately designing 2 new advanced software testing courses. One of the courses is on combinatorial testing. The course focuses primarily on feature decomposition to identify input parameter interactions, modeling input variables, using the more advanced features of our PICT tool to customize the model file, how to generate a variety of subsets of combinatorial tests from a single model to increase test coverage using PICT, and how to design oracles for data-driven automated combinatorial tests.
In this particular course I used the Page Setup dialog in Paint as a feature to model in one of the exercises. And as it turns out, this was a good choice because as it turns out, it has made me rethink how to model input variables for use in combinatorial testing.
I generally don’t advocate hard-coding specific values for input parameters that have a linear range of values. The reason should be reasonably obvious; if we have a range of values from 1 to 100, and I hard-code the values of 1, 10, 50, and 75, 100 (for a positive test) then I have absolutely 0 probability of ever including the value of 42 in combination with other input parameters. To avoid hard-coding values I usually recommend creating equivalent partitions of appropriate input parameters (e.g. xsmall (1-10), small (11 – 25), medium (26-50), etc). Modeling a range of input values using equivalent partitions allows me to randomly select a value in each set, increases my probability of testing with values that I might not otherwise include in a hard-coded set, and adds some degree of variability of inputs for improved test coverage of all possible input values.
However, sometimes we might want to include specific values in the model file we use to generate combinatorial tests. These specific values might include boundary conditions or other values based on historical failure indicators for that feature. In the past I suggested that we don’t necessarily have to specify boundary values in our combinatorial tests. The reason for this suggestion is based on the idea that:
- many boundary issues are single mode faults (meaning the error occurs when 1 parameter is set at or immediately above or below its boundary condition
- testing for single mode errors is often easier and less costly as compared to combinatorial testing
- combinatorial testing might obfuscate the cause of a boundary bug
However, I am now convinced that
- some developers are so inept at unit testing and completely overlook boundary conditions (If you are a developer and only write “happy path” unit tests, please read Pragmatic Unit Testing by Hunt and Thomas, and Clean Code by Robert Martin)
- we find boundary bugs so late in the test cycle that someone determines they are too obscure to fix
- we have “trained” customers to avoid boundaries (due to the number of issues and resultant failures that often occur around boundaries) so we don’t care about them anymore either
- we don’t understand the fault model and therefore don’t now how to adequately identify boundary conditions and test for them
But boundary issues are still fun to find, and they always make for good examples in training or conference demos.
Anyway, on to the bug. While ‘checking’ the ranges of the margins on Paint’s Page Setup dialog for the exercise in this course I came across an interesting anomaly. When the margins were set to values that were grossly outside the allowable margin and I pressed the OK button I got an appropriate error message. But, when I changed the Scaling variable state from Fit to: to Adjust to: the Fit to: value changed to 0 although the textbox control was grayed out. I now realized the margin values are being used to auto calculate the Fit to: output values.
Margin values used to calculate Fit to: value
Since the boundary value for letter size paper with a portrait orientation is 8.5 inches, I decided to see what happens when I set the left margin to 8.501 and the right margin to 0 and then change from Fit to: to Adjust to: to check and see what happens. Interestingly enough, the Fit to: value now changed to 4,294,965,329. OK…now, I just overflowed a variable (the developer only allows the user to input a maximum of 2 characters (99) in the Fit to textboxes).
Surely, I am thinking that a page size boundary is a standard value, and surely someone tested this. But, I decided to check the specific boundary value just to see what happens anyway. So, I set the left margin to 8.5 and the right margin to 0, change the Scaling from Fit to: to Adjust to: and…
Game over!
There are many ways to expose this failure. Another fun way is to set the Scaling parameter to Adjust to: first. Next set the left margin to 8.5 and the right margin to 0 (assuming letter size paper with a portrait orientation), and click the OK button. Then, open the Page Setup dialog again and…game over!
Now, I really didn’t find this bug doing combinatorial testing. In fact, although combinatorial testing might ultimately reveal this problem (depending on the model of inputs provided to the tool that generates variable combinations), this bug was discovered during the data modeling process and discovering where calculations were occurring on certain variables. Once I saw an output boundary anomaly caused by other input variables I forced those input values to target the output boundary conditions of the output variable I wanted to further investigate. So, while we should use failure indicators and experience to specify important values in our combinatorial tests in conjunction with random values within the total population of possible values) I am still not thoroughly convinced that we should always include specific boundary values in our combinatorial test models because I suspect that even the process of modeling this feature for combinatorial testing would likely have exposed this issue.
But, in the end this is really just another example of a simple boundary bug that could have easily been found during unit testing.
Boundary Bugs…like shooting fish in a barrel
If there is a bug at a boundary that doesn’t lead to an unhandled exception or security exploit should we care?
Perhaps an even more important question is why do we find so many boundary type bugs via exploratory testing when they can and should be caught earlier? Why don’t we find these types of bugs in our unit testing? Why don’t we find these types of bugs by more systematically testing the software? Maybe we do find them, and those who make the decisions to fix these types of bugs just don’t care if they are fixed because there is no severe negative impact to the user. Maybe someone just wants to give me fodder for my blog!
This week I wanted to compare the range of allowable font sizes for a simulation program I developed as an example for a magazine article that I am working on. I knew that Office applications allow a font size within the range of 1 – 1638. I thought that range might be too large for my purposes, and since I knew that Windows Notepad included a font dialog I decided to check the allowable range of font sizes in Notepad.
The first thing I discovered was that the combobox control allows up to 5 characters! Really? Someone decided it is a good idea to allow users to enter 5 characters? ![]()
OK, I’ll play along. Maybe if I put in a size of 99999 and press the OK button on the dialog I will get an error message, or at least Notepad defaults to the last ‘valid’ selected font size. That might seem reasonable. But is that what happens? NO! Instead of doing something reasonable (e.g. error message, default font size) the font changes to a size of 1 (yes that is a font size 1 in the upper left corner in the image below).
I am sure that defaulting to a font size of 1 makes sense when the allowable size value overflows! Really…someone thought that was a good idea? Now I wanted to see what magical boundary value the developer decided was an acceptable font size. Since the combobox size property allowed 5 characters I immediately tried 65535. No, that also resulted in the overflow and displayed the text in a font size of 1. Next I tried 32767. Wait…32767 didn’t display the string in Notepad’s edit control at a font size of 1. Now, I am thinking the developer is using a data type of signed short for the font size variable. So, I enter 32768 expecting the value to overflow and display my string as a size 1 font again. But, no…that doesn’t happen.
Now, when I am design boundary tests I generally rely on 2 heuristics for identifying boundary values for input or output parameters.
- Values at the extreme edges of a physical range of values
- Values at the edges of equivalence partitions of physical values
So, in these situations I ask myself “What sort of demented developer debauchery have I now found myself?” I can’t think of any other obvious edge values that might apply, so out of curiosity I quickly narrow down the magical value to 39321. I then ask myself, “OK…even if there were a display capable of rendering or a printer capable of printing a font of this size, what is so unique about 39321?” In hexadecimal it is 0×9999, and in binary it is 1001100110011001. OK…nothing obviously special here, but I am certain the implementation details are much more complex then a simple range of values and at this point I really don’t care because this bug just doesn’t make sense.
Maybe it’s not supposed to make sense! Maybe nobody really cares about these types of bugs!
(BTW…somebody please take the Thesaurus away from the developer…’Oblique?’ Are you serious…why not just be consistent and use the word ‘Italic?’)
Code Coverage: More Than Just a Number
When I was growing up I would sometimes go down into my grandfather’s basement. He had amassed a variety of tools during his lifetime and he was an excellent wood craftsman. I wasn’t allowed to touch any of the power tools, because his rule was, “if you don’t know how to use a tool properly then you shouldn’t play with it.”
Of course, I am a bit of a hard head (even back then) and one day I started playing with the wood lathe while my grandfather was upstairs. Everything seemed to be going pretty well until I pushed the chisel in too far too fast and the wood split and went flying. One piece shattered the overhead light and the other piece ricocheted off the back of my hand leaving an nice gash. I shut off the machine and ran upstairs. After my grandmother cleaned and wrapped my hand, my grandfather made me go back downstairs and clean up the mess and stood over me with a stern look of disapproval making sure I wiped up my blood trail. After that incident, I heeded my grandfather’s advice, at least in his basement shop.
Anyway, with the recent discussions of code coverage around the testing blogosphere I started thinking about what was really being discussed. The discussions (as is the case with most discussions about code coverage) were not actually about the application code coverage as a tool, but more about the code coverage metric. And more specifically the discussions were about how not to assume a high measure of code coverage implies something is well tested. Interestingly enough, 2 years ago I wrote a post illustrating how the metric can be gamed and how the code coverage measure tells us nothing about quality or test effectiveness, but also alluded to how it might be used more effectively.
I thought that how the metric is sometimes misused is mostly self-evident, but then I realized that almost every time testers start talking about code coverage the discussion tends to focus on the metric. This may seem a bit harsh, but if a person’s only contribution to a conversation about code coverage is about how the metric doesn’t relate to quality or testing effectiveness then that person should not be allowed to play with hammers, and employing more complex tools such a wheel-barrows are well beyond that person’s comprehension.
Only thinking of code coverage as a means to get some magic number is akin to thinking “how many nails can I pound with this hammer. The metric itself is mostly irrelevant; and it is completely irrelevant if you don’t know how to interpret it in a way that helps you as a tester. Think about it this way; if we told our managers “our tests achieved 80% code coverage” some of our managers would be elated. (Of course IMHO, these types of managers are metric morons.) But, what do you think these same pointy headed number zombies would say if we told them “we ran our tests and we only missed testing 20% of the code.” I suspect they would start pacing back and forth in the room mumbling “We must run more tests, we must run more tests.”
When we stop thinking of code coverage as a simply measure where our only use of the tool is to try and achieve some magical number then perhaps we can start thinking about how to actually use code coverage as an effective tool to help us design tests (in under-tested or untested areas of the code), reduce potential risk, and possibly even drive quality upstream.
For example, one of my mentees is currently working on a project that uses just in time code coverage as a tool to evaluate how tests exercise changed code and downstream dependencies prior to checking code changes (e.g. bug fixes) back into the main tree. The initial pushback by some members of the team (including some pointy headed managers) was “code coverage doesn’t tell us about product quality” or “its too hard to achieve 80% code coverage” (although no such goal had been mentioned), and my personal favorite, “it’s too difficult to get everyone to measure coverage.” I reminded my mentee that the project is not about achieving some magic number, and in fact, it’s really not even about measuring at all. It’s about using the tool to discover information and to help us design additional functional tests at the API or component level that we might otherwise overlook to help prevent downstream regressions. In a nutshell, its about using code coverage as a defect prevention tool in this case.
Bottom line, code coverage is a tool! If you don’t know how to use it to improve your testing, well…
Boundary bug hunting; sometimes it’s almost too easy!
This past weekend I was working on a new test tool library for generating random email addresses; specifically the local address segment of an email address. I know, there are already a lot of email address generators available and this could be construed as reinventing the wheel. But I wanted to give my students in my test automation course at the University of Washington something to test at the API level. So why not have them test a test tool and learn a bit more about API level testing and how to use combinatorial analysis of the input property values to drive a data-driven automated test case. Also, having them test it means that I don’t have too!
Anyway, one of the tool’s properties is a character array of invalid characters for the specific email address system under test. Although the guidelines for email addresses are outlined in RFC 5322 and RFC 2821 many companies can place greater restrictions on the characters that are allowed for the local address component of an email address (the local address is the part before the ‘@’ character).
For example, Yahoo only allows a local address to be between 4 and 32 characters, the first character must be a letter, and only letters, numbers, underscores and only 1 period character. The Google mail local address is between 6 and 30 characters, and only allows letters, numbers, and (multiple) period characters. Hotmail and Live mail allow local address name lengths between 6 and 64 characters (64 is the maximum allowable size according to RFC 5322), and can only contain letters, numbers, periods, hyphens, and underscores.
Even from these few examples we can see a couple of things. First, although we are testing email addresses there is not a universal set of equivalent partitions that works in all contexts. We need to partition the test data into equivalent class subsets based on the specific domain we are testing. For example, the invalid class subset of characters for a Google local address includes the underscore character, but both Yahoo and Hotmail allow the underscore as a valid character in an email local address. (But, I will talk next week about the equivalent partitioning of this data…for now let’s get back to boundary testing!)
Back to my story – as I was exploring each email providers requirements in order to determine how to partition the data I discovered a interesting problem with Yahoo. Remember, the maximum length of the local address for a Yahoo account is 32 characters. ![]()
And, the textbox control property on the web page is set to only allow a maximum input of 32 characters to prevent the user from inputting more than 32 characters. Copying a string longer than 32 characters into that textbox simply truncates the string after the 32nd character.
But, when I bump up against the maximum allowable length with some test strings the underlying program that generates suggested alternative local address names will actually produce a local address of 35 characters in length!
Now, if the software message tells me I can’t do something (like have a local address name of more than 32 characters and then the software generates a local address name of 35 characters for me…well, I am the sort of fellow who will push that button!
And sure enough it looks like I can use it. But wait. Only one more button to push and…
What do you mean “Sorry, this appears to be an invalid Yahoo ID?” You generated an invalid local address for me! Why would Yahoo mail torment me so?
I am thinking in the developers mind the user story went sort of like;
User: “I would like this.”
System: “No you can’t have that, but you can have this.”
User: “OK”
System: “No, you can’t have that either.”
It’s funny this came up this week because I was talking with a group of senior SDETs about defect prevention versus defect detection and how 99.999% of boundary issues can be found at the unit level or API level of testing well before the UI is slapped onto the functional layer.
Testing the functional layer more thoroughly or a code review would most likely have revealed this ‘magic’ number was inconsistent. Or by forcing the algorithm that generates suggested local addresses to test boundary conditions would have much sooner exposed this problem.
Now I don’t know Yahoo’s development and testing practices, and unfortunately it’s not uncommon to overlook bugs similar to this. But, I suspect that if developer rely on testers to find all their bugs, and testers primarily rely on testing through the user interface to find bugs then we are always going to find boundary bugs post release (and that’s a good thing because it gives me something to blog about).
Evaluating Exploratory Testing
This month’s issue of Testing Experience published my article that summarizes the findings of several case studies of exploratory testing both inside and outside of Microsoft. Although some people consider me to be a harsh critic of exploratory testing nothing could be further from the truth. When I started my career as a professional tester my approach to software testing was primarily exploratory in nature. I was focused on executing as many negative tests I could possibly conceive of in search of the most heinous bugs I could find; and I was good at it. My criticism is not of exploratory testing as an approach; however, I do ‘question’ the claim that claim exploratory testing is “orders of magnitude more productive.” And, I am also critical of the argument that we don’t understand exploratory testing if we don’t conform to one notion of the concept (or buy into an ideological doctrine) because I don’t believe that there is only one ‘right’ way to perform or think about exploratory testing.
Of course, I know it is un-unpopular to question the claims of exploratory testing ‘experts,’ but I just happen to be one of those people who question things that are founded on anecdotal observations without any hard data to substantiate those claims. I certainly don’t have all the information, but I personally like to be able to back up my position with facts (known at the time) and several verifiable/repeatable data points so I can answer questions from a defendable position rather than trying to convince or cajole someone with my subjective opinion. (I know a lot of studies show that many Americans base their decisions on their emotional state at the time. But I learned a long time ago that you should never buy the boat you fall in love with because you will spend more time maintaining her than sailing her.) Also, it’s easier to persuade me that I might be wrong with solid, verifiable information and repeatable data versus emotional rhetoric or personal insults.
I think most people who promote exploratory testing are well intentioned and realize in conjunction with other testing approaches that exploratory testing adds value to any testing effort. I also think that many practitioners realize that while we must not only hone our intellectual capabilities of critical thinking and logical reasoning, we must also constantly build our knowledge and skills of the other approaches, methods, and techniques used in our professional trade.
At Microsoft, I can’t think of any testing group that does not use exploratory testing as part of its overall strategy. We have learned not to rely on exploratory testing as our primary approach because it simply doesn’t scale as project size and complexity increase, and it is easy for testers to focus too much on out of context issues in hopes of finding another bug. As one Principal Test Manager summarized, exploratory testing helps
- flush out “low hanging fruit” (identify obvious issues very quickly)
- provide welcomed context switching by getting folks to look at other areas of the product
- to seed new testing ideas or helps identify holes (which is great as long as we have a way to preserve those ideas and they are learnable by other testers)
But, of course, it was also noted that greater ‘system knowledge’ and an understanding of other various testing techniques and approaches enriched the overall effectiveness of the testers on the teams. My job as a teacher and mentor of software testing is to take really smart people who already know how to think critically about problems and provide them with the foundational knowledge of alternative techniques, methods, approaches, and the skills that are specific to the profession of software testing that will enable them to decide what approach to use depending on the context.
Similar to other testing approaches exploratory testing has benefits and limitations and is more effective in exposing certain categories of issues, and is less effective at exposing other types of problems. (See post on Pesticide Paradox.) And now we have researched case studies that begin to help us understand how to utilize exploratory testing as part of our overall testing strategy. Of course, further research could be done in this area, but it is very interesting that the independent studies used in the article reached similar findings and conclusions.
Anyway, I look forward to comments or feedback on the article.
Refactoring for Testability

Teaching my daughter about bullet seating depth.
One of my hobbies is shooting CMP matches and long range precision shooting. Besides lots of practice perfecting the techniques a big part of precision shooting depends on the ammunition and studying the ballistic patterns of various loads. All precision shooters custom load their ammunition and it is not as simple as simply reading a reloading manual. Slight variations of .001” of an inch in seating depth of a bullet or .1 grain of powder may determine whether the group of shots at a target 600 yards away is 1” MOA or 6” MOA. So, getting the ammunition to match the rifle requires continually analyzing your shots, making slight adjustments to the load, and repeating; in computer jargon we might call that refactoring. Reloading for precision is a continually optimizing process until we find the optimal load. Similarly, one of the things we do in the Engineering Excellence group at Microsoft is to continually analyze our internal processes and practices to see how we can help our business groups constantly improve and optimize towards their target. One of the big things on our plate these days is testability.
In Testing Object-Oriented Systems: Models, Patterns, and Tools, a book I consider one of the most important books on software testing practices, the author Robert Binder defines testability as “The relative ease or difficulty of producing and executing an economically feasible test suite to determine whether the [system under test ] SUT (i) conforms to stated requirements and specifications, and (ii) exhibits an acceptably low probability of failure.” This and several definitions of testability floating around on the web and all generally agree that testability generally involves
1.) The ease with which the SUT can be tested
2.) The cost of testing is reasonable
So, as the testability increases the ease with which our tests can determine whether the SUT satisfies implicit and explicit requirements and has a lower chance of failure at reduced testing costs. This all sounds nice, but unfortunately testability cannot be directly measured; testability is a qualitative measure. Although we can’t accurately measure testability we can sometimes do small things to improve the characteristics of testability and help reduce testing costs by reducing the number of tests required to determine whether the SUT satisfies the stated requirements and also has a low chance of failure, or finding ways to test more efficiently through better designs.
In last week’s post I referred a pseudo code example that was written to illustrate how bugs could linger in code despite a high measure of code coverage. Of course we should realize that pseudo code is generally a far cry from the real implementation of the code. Pseudo code is simply a model, and there are many ways to implement that model. The advantage of a model is that we can often test a model earlier to identify potential issues before a single line of code is written. In this particular pseudo code sample, there were a couple of things that stood out that could likely impact the testability of an implementation of the pseudo code model. So, the neurons in my brain starting firing with lots of testing related questions.
So, let’s use that example to discuss potential testability issues. The sample was based on a requirement that stated “Student ID’ are seven digit numbers between one million and 6 million inclusive.” The function is relatively simple in that it takes a string type passed to the sid parameter, and returns a Boolean true or false to the calling function depending on whether the string satisfies the internal Boolean conditions it is being compared against. But this function also calls 2 other functions; the length () function, and the number () function. From the function names I would think the length () function provides a numeric value that represents the number of characters in the string passed to the sid parameter. I am also betting the number () function returns a numeric value (it converts the string variable to a numeric type such as an integer. The pseudo code example was
function validate_studentid(string sid) return
TRUEFALSE
BEGIN
STATIC TRUEFALSE isOk;
isOk = true;if ((length(sid) is not 7) then
isOk = False;if (number(sid) <= 1000000 or number(sid) > 6000000 then
isOk = False;return isOk;
END
One of the reasons that we hire testers with a programming background at Microsoft is that they can help the developer identify potential issues, reduce the probability of failure, and improve testability by stepping through the code during peer reviews, or while designing additional tests to cover un-tested or under-tested areas of the code that are exposed by code coverage analysis. So, when I come across a code sample, I generally step through it to
- See if it will work as intended (basic unit test)
- See if there are any potential obvious errors in logic
- Identify tests necessary for branch or conditional coverage (because developers are usually only concerned with block coverage)
- Identify argument values for negative testing that might expose undesirable results (bugs)
So, in this pseudo code example, once I got to the second conditional clause (if (number (sid)) <= 1000000 or number (sid) > 6000000 then) the little cranks in my brain began to turn. I thought to myself, why are we checking the length of the string? I mean, if the number can only be between 1,000,000 and 6,000,000 then it seems to me that checking the length of the string is simply redundant.
If we remove the first conditional clause (if ((length(sid) is not 7) then) then we actually reduce the number of tests to 3 instead of 4 assuming short-circuiting since short-circuiting compound Boolean expressions is one of several code optimization techniques. (By the way, the first caveat example in Wikipedia on short-circuiting where a function used as a Boolean conditional also “performs some required operation regardless of whether the first conditional evaluates true or false” is simply poor architectural design and is very, very likely to be problematic.) The 3 tests for condition (and basis path) coverage to exercise the true and false outcome of every single Boolean conditional expression are listed in the table below.
| Conditional 1 | Conditional 2 | ||
| Test | number (sid) <= 1000000 | number (sid) > 6000000 | Expected Result |
| Any value between 1000000 and 6000000 | false | false | true |
| Any value > 6000000 | false | true | false |
| Any value < 1000000 | true | (short-circuited) | false |
Of course, even testing several samples from the equivalent partitions may not expose the bug in this code because the bug in this code is a typical boundary error. (In a previous post I explained the basic fault model that caused many boundary issues. In a nutshell, boundary bugs are generally caused by incorrect relational operators or magic numbers in code.) Without recognizing that we also need to test the boundaries (999999, 1000000, 1000001, and 5999999, 6000000, 6000001) also we could easily overlook the error in the pseudo code.
Another thing that caught my attention was the lack of exception handling. Some people may not consider including exception handling in pseudo code and take it as a given. But, as a tester when I don’t exception handling in pseudo code in a review then I need to start asking questions so I can better design tests to exercise the exception handling control flow paths that directly impact code coverage measures. Another reason this is an important consideration is because results of code coverage analysis indicates that exception handlers are generally under-tested. It seems we are really good at finding unhandled exceptions with our negative tests (which is really good), but we do not seem to be as thorough in testing the logical code paths of exception handlers. This is especially true for predicate statement with multiple Boolean sub-expressions might trigger an exception. We tend to test one of the conditionals, and the other conditionals expressions in that statement are often under-tested.
So, we can surmise the number () function must be converting the string parameter (the sid variable) to a numeric type and returning a type of number because the conditional clause is comparing it to magic numbers (1000000 and 6000000). But if we entered a string that contained non-numeric characters my initial thought was that the number () function would throw an exception that is unhandled by the validate_studentID () function.
Then I thought a bit more, and considered that the number () function might swallow the exception and return a 0 or even a -1. Now, there are some arguments in favor of swallowing exceptions, but in general it is not a good idea. In this case, it is probably a bad idea because one of the primary purposes of a separate function is reusability. If the number () is reused in some other code, or other part of the code where we need to convert a string to a numeric type regardless of the range (within the range of the data type being converted to), I would suspect we would want to throw an exception, and then rethrow the exception in the calling function. Of course, this is where the rubber hits the road, and a professional tester needs to dig in and start asking some hard questions as to how the developer is going to handle this situation. If the number () function is not going to be reused, then most modern programming languages include a function call that will easily convert the string to a numeric type and do it more efficiently as compared to calling a separate function. And may in that case we could swallow the exception in the validate_studentID () function and simply return false as illustrated in the C# code below.
1: try
2: {
3: if (int.Parse(sid) < minValue || int.Parse(sid) > maxValue)
4: {
5: isOk = false;
6: }
7: }
8: catch (FormatException)
9: {
10: isOk = false;
11: }
12: catch (OverflowException)
13: {
14: isOk = false;
15: }
With the push to drive quality upstream, reduce costs (especially testing costs), and improve testability I envision that many testers will be working alongside our development counterparts to help them prevent defects from getting into the product code base, and improve the maintainability of the code. This doesn’t mean that testers will become developers or visa versa; it simply means that testers are (generally) experts in designing tests, and developers are experts in designing solutions that adhere to requirements. Rather than an adversarial relationship, I suspect in the future developers and testers will have a more symbiotic relationship to improve the intrinsic quality of our code bases.
The bottom line of all this is that in teams where testers are designing white box tests for improved code coverage (control flow testing), or where testers are engaged in design reviews or peer reviews of code prior to check in, I hope this gives you some things to think about.
Reconsidering Code Coverage
Tonight on my way to teach a test automation course at the University of Washington I had some free time to catch up on my reading. My manager asked me if I had read this month’s copy of one of the several testing magazines we get and I replied that I had downloaded it but hadn’t had a chance to read it yet. So, he tossed me the hardcopy of the magazine and said, “Enjoy.” Now this should have been a clue because although Alan is a great manager and mentor, I think he secretly likes to see the veins in my neck swell and blood shoot out of my eyes from time to time.
I read a lot of articles, white papers, and books. I like most of what I read, even if I disagree with some of the points being made. I can’t remember ever reading an article on software testing that ever made me angry. I was not angry because of the message of the article. In fact, I think the point the authors are trying to make is valid and I agree with them on their fundamental point. Unfortunately, the article is filled with technical inaccuracies the end message was almost lost.
I spent the last 10 years studying various techniques, methods, and approaches in software testing. I teach more than 500 testers a year on structural testing techniques, and am now working with a team in the Windows division to implement a new tool for just in time code coverage analysis at the component level that allows us to see how our tests exercise code paths in changed code and the dependent modules. I also discuss structural testing in chapter 5 of our book How We Test Software At Microsoft. I don’t really consider myself to be an expert in the subject, but I might know a thing or two about it. So, let’s Reconsider Code Coverage!
In August 2007 I wrote an informative blog post on the potential misuse of the code coverage measure. But code coverage measures are used by some companies as one of many ways to help them reduce risk. And, let me be very clear here, there is no correlation between code coverage and quality, and code coverage measures don’t tell us “how well” the code was tested. The code coverage measure simply measures what code has been executed, and more importantly what code has not been executed. The value of measuring code coverage is not in producing some “magic number,” but that it helps testers investigate untested or under-tested areas of the product and design additional tests (generally using structural testing techniques) to improve coverage and reduce overall risk.
Just because you execute a line of code doesn’t mean a bug doesn’t still exist, but if you don’t execute a line of code you have 0 probability of finding a bug if one exists!
Also it is important to note there are several ways to measure code coverage. Different tools employ different measures and sometimes different tools measure the same type of coverage differently. Also, I discovered that even the same tool can measure the same code differently depending on how it is compiled (debug, retail, etc.) and previously wrote about my study. Some of the basic ways to measure code coverage (not test coverage) include:
- Function coverage measures the percentage of functions or methods in a class or application that are called at runtime.
- Statement coverage measures the percentage of executable statements exercised at runtime.
- Block coverage measures the percentage of each sequence of non-branching statements that are executed at runtime. Block coverage subsumes statement coverage.
- Decision or branch coverage measures the percentage of both Boolean (not binary) outcomes (true and false) of simple conditional expressions at runtime. If a predicate statement has more than one conditional sub-expression decision (or branch) coverage treats that predicate statement as one conditional clause. Decision coverage subsumes block coverage.
- Condition coverage measures the percentage of both Boolean outcomes of each conditional sub-expressions that are separated by logical and or logical or in compound predicate statements. Condition coverage subsumes decision coverage.
- Basis path coverage measures the number of linearly independent paths through a program. Basis path coverage is based on McCabe’s cyclomatic complexity research.
- Path coverage measures every possible path from the entry to the return statement (or exception) or exit of every method. Unfortunately path testing is usually impossible due to the sheer number of path combinations, and the inability to execute constrained path combinations.
Clearly there are different measures of code coverage, and certain types of measures subsume other measures. So, now that we have a handle on the different types of code coverage measures, let’s look at testing some code. We will use the same pseudo code used in the aforementioned article which is based upon the following requirement.
“Student ID’ are seven digit numbers between one million and 6 million inclusive.”
The authors provided the following pseudo code example for a function to meet this requirement.
function validate_studentid(string sid) return
TRUEFALSE
BEGIN
STATIC TRUEFALSE isOk;
isOk = true;if ((length(sid) is not 7) then
isOk = False;if (number(sid) <= 1000000 or number(sid) > 6000000 then
isOk = False;return isOk;
END
So, other than the fact that there is no reason to ‘test’ the length of the sid variable before evaluating it to see if it is within the allowable range (removing this first conditional improves performance and also improves testability of the code), and that if the call to the number() function fails to convert the string to a number for a valid Boolean comparison it will throw an unhandled exception, let’s look at path testing of this simple example by starting with control flow diagrams of each possible path (assuming the call to the number() function does not throw an unhandled exception by passing this message a string of characters such as “foo” rather than a string of digits).

Control flow diagram for validate_studentID() function pseudo-code
(Edited 11/25: After thinking about this a bit more, if the number() function returned a 0 (zero) if the input was incorrectly formatted, then the number() function would not throw an exception, and the control flow path would be identical to the first test in the table below).
Because we are doing path coverage testing and not decision testing, we actually have to separate each Boolean conditional sub-expression in the second compound predicate statement if (number(sid) <= 1000000 or number(sid) > 600000. The example in the article treated both sub-expressions in the compound predicate statement as a single Boolean expression which would be synonymous with decision coverage. Path coverage actually treats each sub-expression as if there were 2 single Boolean conditions such as
if (number(sid) <= 1000000
isOk = False;if number(sid) > 600000
isOk = False;
The table below illustrates the tests required for testing control flow through this function for path coverage (again assuming we are going to ignore the unhandled exception in the code that would occur by passing in a string such as “foo.”)
| Input (sid) | Conditional length(sid)!= 7 |
Conditional number <= 1mill |
Conditional number > 6mil |
Expected Result |
Actual Result |
| 999999 | true | true | false | False | False |
| 6500000 | false | false | true | False | False |
| 1000000 | false | true | false | True | False |
| 6000000 | false | false | false | True | True |
The first test would be a value less than 7 digits, and would cause all Boolean conditional expressions to evaluate as true which will set the isOk variable to false (3 times), and we correctly return the expected result of false (or invalid ID). The second test is a number greater than 6,000,000 (but less than the maximum value that would result in an unhandled overflow exception hopefully being thrown by the number() function). In this case the 3rd conditional expression (if (number(sid) > 6000000) would evaluate as true and the function would return false. The 3rd path is buggy. In this pseudo code example, the only possible way to exercise the true outcome of the Boolean condition if (number(sid) <= 1000000 is to use the value of 1,000,000; any other value larger or smaller will cause this Boolean condition to evaluate as false. In this case we expect the function to return true, but it in fact will return false. Finally, any number greater than 1000001 and less than or equal to 6000000 will return a true result indicating a valid student ID.
The article also suggest that structural testing misses other problems. But, when we look at these issues, they actually have nothing to do with structural testing of the function; in other words they are completely out of context of the problem being discussed.
For example, the assert is the requirement is incorrect and should have read 6,999,999 (which I believe is a typo and should be 5,999,999) because of confusion over the word “inclusive.” Inclusive means “including the stated limit or extremes in consideration or account,” but in computing inclusive means “the predicate holds for all elements of an increasing sequence then it holds for their least upper bound.” I disagree with this assumption because I suspect the analyst writing the spec is basing the inclusive range on the common definition, and not a definition based on domain theory.
The article questions what would occur with incorrectly formatted numbers such as 123 456 789 or 123,456,789. So, beside the point that these values are not within the valid range of student id numbers, the answer to the question would actually lie in how the number() function being called handles improperly formatted numbers (e.g throwing a format exception, which again is unhandled in our validate_studentid() function), or how an event handler that sits between the UI and the function might deal with invalid or incorrectly formatted inputs.
The next question concerned resizing of the input window or the screen (assuming desktop resolution) and repainting the window or form and its affect on code coverage of the validate_studentid() function. Well, I am going out on a limb here and I am going to say…”what are you talking about?” I am not quite sure how to phrase this, but let me try…resizing or repainting a window has 0 effect on the structural control flow of the validate_studentid() function. (Of course, I could be wrong, and the length() function number() function might have some code that mysteriously interacts with the repainting libraries and how it determines the length of a string or whether a string is a valid number.)
Bugs in external libraries are part of the business. Hopefully those external libraries are well tested or at least documented especially if our development team wrote them. Personally, I have not encountered any public functions or APIs which use wild ass random numbers such as 5.8 million as boundary values, but that’s not to say it couldn’t happen. And of course, if these external functions throw exceptions (as they should based on what they are probably doing), we should have exception handler code in our function to deal with any exceptions thrown from external libraries or function calls.
Based on incorrect path analysis, and out-of-context questions that have nothing to do with control flow through the validate_studentid() function the article suggests that path testing is not a magic potion, but I am not too sure that anyone actually believes it is. And so, the article suggests that “input combinatorics coverage” might work better. Hmm…now I have been teaching combinatorial testing for over 10 years and have read some interesting papers on the effectiveness of combinatorics on statistical testing and code coverage, and I must say I pretty sure you need more than one input parameter in combinatorial testing!
Finally, I don’t agree that code coverage measures tell us “how well the developers have tested their code.” The code coverage measure only tells us what percentage of the code has been executed in a particular way, and more importantly it tells us how what percentage of code has been untested. We must determine whether we need to investigate that area to reduce risk. Of course, many code coverage tools provide a “heat map” that helps us and developers identify untested code, and that is where we shift from the simple act of measuring coverage to the testing method of code coverage analysis in order to design new tests that effectively exercise previously untested code if that level of coverage is important to reduce overall risk.
My intent here is not to ridicule the authors of the article. In fact, I agree with their summation that testers should not believe high code coverage numbers mean “well tested.” (Again see my blog post from Aug 2007.) Unfortunately, the path to the point was fraught with inaccuracies and tangents that I almost never made it to the end.
There are many books and white papers on this subject in the ACM and IEEE libraries. Books by Boris Beizer, Robert Binder, and others go into great detail on structural testing. McCabe’s papers linked to in this post are an excellent resources.
OK…I feel better now. I need to clean up the blood, take a sedative, and go to sleep.
Adding Variability in Test Case Design
Published Tuesday, October 20, 2009 ![]()
I love autumn! Yes, I am definitely a boy of summer and very much prefer warmer weather; however, there is something special about autumn. This past weekend my daughter, and my 2 friends Dongyi and her husband Yuning and I participated in the Rum Run sailboat fun race with an overnight raft up at Bainbridge Island’s Port Madison. Saturday morning was quite rainy, but the wind was blowing 15 knots with gusts to 25 knots and NOAA weather radio announcing gale force warnings in Puget Sound. Wow…what a ride! But, it was actually the rather relaxing sail back to my marina on Sunday morning that rekindled the beauty of autumn in my mind. The bright reds, golden yellows, and pastel browns of the foliage seemed to blend into a collage framed by the darkness of the waters of Puget Sound and the snow covered peaks of the Olympic mountains. The beauty of autumn reminds me about change. A sloughing of the old, the cleansing brought about by the pure white snows, eventually followed by the new and fresh growth that blossoms in spring.
Just as the earth goes through variable cycles of rejuvenation, we must also continually update our tests, and (more importantly) the test data we use in our test cases to prevent them from becoming stale. Trees shed their leaves in the autumn and new leaves emerge in the spring, but the tree is fundamentally still the same tree. Similarly, a well-designed test case has a unique fundamental purpose and by changing the variables we can grow the value of that test case. Of course, the cycle of change in test data should be dramatically shorter in duration as compared to the seasonal changes of mother earth.
Here is a simple example of how a well-designed test case using variable test data can increase the value of the information each test iteration provides through increased confidence and also potentially reduce overall risk. In my role at Microsoft I am in a unique position to not only conduct controlled studies, but I can also implement ideas into practice on enterprise level software projects. One experiment I started about 2 years ago involved multiple groups of testers (sessions) located around the world divided into 3 separate control groups. Each control group tested the identical web page that would display the stock price if the user input a valid stock ticker symbol into a single textbox on the page and pressed the OK button. The only difference in the control groups was the instructions to perform single positive test case with the specific purpose of “ensure any valid stock ticker symbol displays the current stock price for the publicly traded stock specified by its symbol.” The purpose of the study was to determine if different cultural and experiential backgrounds impacted the test data used in a test based on the instructions for a test case. The study collected demographic information on the participants as well as specific inputs applied to the web page. Information on the oracle used by the students was collected anecdotally. Step one in each test was identical because we were not interested in how the tester launched the browser. (Of course this assumes there are other tests that test the multitude of ways to launch a browser and navigate to a URL. Also, if the browser failed to launch the test case is blocked.)
Group 1 was given the most vague instructions for the test case. The instruction was simply:
- Launch browser and navigate to [url address]
- Enter a valid stock ticker symbol and press the OK button and verify the accuracy of the returned stock price.
The instructions in the test case given to Group 2 were also somewhat vague, but provided a little guidance both on input and oracle.
- Launch browser and navigate to [url address]
- Enter a valid stock ticker symbol (e.g. “MSFT”)
- Press the OK button
- Verify the returned stock price is identical to the current stock price listed on the appropriate exchange
Group 3 had similar instructions to Group 2, but the group was given additional guidance as indicated below.
- Launch browser and navigate to [url address]
- Enter a valid stock ticker symbol from a publicly traded stock listed on any public stock exchange
Listings of valid stock ticker symbols are on stock exchange web sites such as:
http://www.nyse.com
http://www.eoddata.com/Symbols.aspx
http://www.nasdaq.com
http://www.londonstockexchange.com - Press the OK Button
- Verify the returned stock price is identical to the current stock price listed on the appropriate exchange
Results
The results were mostly not surprising, but rather reinforcing. For example, we expected Group 1 to be rather random, but mostly aligned with ticker symbols they were familiar with. Of course, the majority (90%) of stock ticker symbols entered was MSFT and there was no significant difference in cultural background, locale, experience or educational background. (As this study was conducted at Microsoft I am sure there was some bias as to the symbol entered.) What was most interesting was that testers with no formal training (no previous courses in testing, no CS degree, and read less than one discipline specific book) and with more than 2 years of test experience were approximately more likely (25%) to violate the purpose of the test and enter random or completely invalid data as their first action. In other words, instead of executing the required test their initial reaction was to immediately go on a bug hunt.
In group 2 99% of the participants simply entered the stock ticker symbol “MSFT.” But, what was even more surprising was the fact that one the next day, the same people in that group were given the same exact test, and 95% of them simply reentered MSFT. Perhaps this is laziness, perhaps this is related to the superficial nature of the study, or perhaps this is due to individuals taking the path of least resistance. The percentage of people who entered identical stock ticker symbols on consecutive days was not significantly different between group 1 and group 2.
It should be no surprise that group 3 had the greatest distribution of variable test data applied to the web page. Demographics had no impact on any of the people who were in group 3. The majority of people in group 3 (78%) would select the first stock exchange listed (regardless of what link it was) but there was no significant overlap in the selected stock ticker symbols. When asked to repeat the test on the next day 83% of the participants selected a different link and and a different symbol. Of those who selected the same link 97% selected a different stock ticker symbol. On the down side, approximately 4% of the people simply took the path of least resistance and input MSFT as the test data on both days of the experiment.
Conclusion
One of the most common problems I hear about ‘scripted,’ or pre-defined test cases is that they are too prescriptive and not flexible enough to allow the tester to try things. Of course, a well-designed test case is not simply a prescriptive set of steps inputting the same hard coded test data they run over and over. So, in this study we made the assumption that a scripted test case that specified “Enter MSFT in the textbox” would simply result in the tester entering “MSFT” without any thinking on the part of the tester. Hard-coding variable test data is often times the worse possible way to design a test case.
Vaguely written test cases added some level of variability, but also seemed to increase the probability of the tester executing context free tests outside the scope of the purpose of the test. In fact, what we found was some testers (approx 2%) simply went on a bug hunt and never actually input a valid stock ticker symbol at all during the session.
A test case that provided only one example that is representative of the type of test data required for the test case produced the least desirable results in this study. I am not sure this would be the case in practice. However, based on this study if I were to outsource execution of a test case similar to that used by group 2 the only thing I could guarantee is that MSFT would definitely be tested numerous times, and the variability of other test data would be extremely limited regardless of the number of testers executing that test or the number of iterations.
When faced with a virtually infinite number of possibilities for input variables as test data used in either positive or negative tests we need to test as many possibilities as possible given the available resources in order to increase test coverage and reduce overall risk. So, one way increase the coverage of test data while still achieving the specific purpose of the test case is to provide useful resources that help guide the tester while relying on the tester’s creative thinking skills and curiosity to expand the test coverage.
Of course, we can also increase variability of test data and capture the essence of the tester’s creativity using a similar approach in a well-designed automated test case as well. In fact, a similarly designed automated test case enables us to significantly increase the amount of variable test data that is exercised in order to expand test coverage and increase overall confidence.