Archive for October, 2010
In the previous post we discussed how hard-coding an extremely small subset of values out of a large population of all possible values for a given input parameter is rarely a good idea because it precludes any chance of testing with other values. By dividing a large population of possible input values into smaller subsets we can effectively increase our distribution across the range of possible values. And by randomly selecting values from each smaller subset each time a test specifies that subset in a combination we increase the probability of testing with a greater number of possible values.
But, another ‘cause’ of failure of combinatorial testing mentioned in the aforementioned paper was “Similarly, a number of "not found" faults were 2-way faults that were not detected because a particular combination of data values had not been selected.” This is indeed a difficult problem and there is no definitive solution. But, I have found over the years that professional testers are those individuals who constantly search for alternative ways to help them solve a difficult problem rather than simply complain about them. While there is no way to guarantee that we select a “particular combination” of data values that might expose an issue perhaps there is a way to test different combinations with different data values. Increasing variability is a great way to increase coverage and potentially trigger an error caused by unknown “particular combinations” or values for a given variable.
In the last post we modified our model file to use abstract ranges of font sizes and we are using a random number generator to select a value within each specified range to get a better distribution of font sizes across the entire population of possible values. This way we help prevent issues associated with using only a small number of hard-coded values in our tests. So now our model file for the font dialog is similar to:
# Basic Model File for MyFontDialog
Font: Arial, Tahoma, BrushScript, MonotypeCorsive
Style: Bold, Italic, BoldItalic, None
Effects: Strike, Underline, StrikeUnderline, None
Colors: Black, White, Red, Green, Blue, Yellow
Size: small, smallH, nominal, nominalH, large, largeH, xLarge, xLargeH, xxLarge, xxLargeH
Now instead of the over 1,250,000 combinations (assuming we would test each font value), the total number of combinations for the font dialog based on our current model is 3840 (assuming all combinations are valid). Using this model the the PICT tool will generate approximately 60 test combinations as the baseline set. But, the baseline set of tests generated by the tool is only 1.5% of all combinations. Most studies indicate the baseline set of combinations to be pretty effective in defect detection effectiveness (DDE) and improved code coverage. But, the baseline set of tests may not include hidden or unknown ‘particular combinations’ that might be problematic.
If we were to visualize the total number of combinations as the large circle and the baseline set of tests as the smaller red circle we can see that the baseline only covers a relatively small portion of the total number of combinations. Obviously there are a lot of other combinations that are not being tested. If we know of particular combinations that should be tested we can easily include those in our set of n-way tests (and we will see how we can do that in a later post). But, if we don’t know what ‘particular combinations’ might cause an error we need to find a way to more effectively increase the number of combinations tested.
Unfortunately, most combinatorial tools will only generate a single baseline set of test combinations. Even if we randomize the values for any given input parameter we are most likely only to expose single mode or one-way faults (errors that are caused by a single input parameter value regardless of the other input parameter values).
The question becomes how can we effectively expand the our test coverage to include different combinations beyond our original (or only) baseline set of combinations. If we select different ‘sets’ of combinations we are effectively testing different combinations in the population of all possible combinations.
Fortunately the PICT tool has the ability to generate random sets of combinations from a single model file. By passing a ‘/r:n’ switch argument in the command line to call PICT the tool will generate a different set of combinations. The outputs below illustrate 2 different sets of combinations generated by the from our basic model file.
The set of tests in the simpleout1.xls file are the baseline set of tests generated by the PICT tool. Most tools are only capable of generating a single baseline set of tests. However, using the the PICT tool we passed the /r:42 switch as a command line argument. With a user defined seed value the PICT tool generated a different baseline set of tests illustrated in the simpleout2.xls file on the right.
Occasionally there may be duplicate combinations in the various sets of tests. But, if we generate a ‘new’ set by passing different seed values with the /r switch for each new build we can effectively increase our test coverage and the probability of testing (unknown) ‘particular combinations’ by changing the set of combinations generated by the tool.
How many tests do we need?
A common question that is often asked is, “given a large number of possible tests or values, how many tests or values do we need to test?” There is no simple answer to this question. But, we do know that testing with an extremely small subset of values (or combinations) may not provide us with high levels of confidence.
When dealing with a large population of possible combinatorial tests (or values in a variable), one way to improve confidence is to increase the sample, or the number of different combinations tested. In a previous post I discussed the concept of sampling from a testing point of view. Sampling is often used in scientific research and experimentation. For example, if we assert that all 1,250,000+ combinations should produce an equivalent output (in our case the appropriate changes to the glyphs in our edit control), then we can increase our confidence that assertion holds true by increasing the number of samples (n-way test combinations).
We would only be 100% confident if we tested all possible values for all possible combinations. But, that may not be feasible in all cases, and the output of combinatorial test tools tend to optimize on a minimum or baseline set of tests. So, one approach to help us increase our confidence in test coverage is to use a statistical sample calculator to help us approximate the number of samples (or different combinations) that we should test to achieve a desired level of confidence. In our demo, we stated there are now 3840 possible combinations (assuming all combinations are valid). Given this population of possible tests the number of samples (different combinations) we would need for a statistical confidence of 99% with a 3% sampling error and a standard deviation of .5 is 1248.
(NOTE: In statistical sampling, the smaller the total population, the greater the number of samples. Automating your combinatorial test for a non-trivial feature is a best practice and can help us effectively expand our coverage of n-way combinations.)
Of course, even if we tested 1248 different combinations there is still no guarantee that we will test the ‘particular combinations’ that might trigger an unexpected anomaly. But, no other approach to testing can guarantee we test with these unknown ‘particular combinations’ either. However, systematically increasing the sample size of any given population of possible combinations (or values) might likely increase our likelihood of exposing an anomaly not detected by a static output from a basic combinatorial testing tool, or at least increase of overall confidence.
Testing essentially helps provide confidence and reduce risk because we can’t test everything!
In the next post we will discuss how to effectively deal with invalid variable combinations.
This week I am in Seoul, Korea attending the ASTA Seoul International Software Testing Conference (SSTA 2010). It has been several years since I have been to Korea, so being invited to give the opening keynote at this conference was a real honor and an opportunity that I couldn’t pass up. This is a relatively small conference of about 175 people and the attendees are Korean and all the presentations were translated in real time. The speakers; however, came from around the world. I was mostly impressed with the representation from large companies such as LG, and Samsung. Since these testers worked mostly on devices a lot of their tests were without a GUI, so they really understood the idea of moving quality upstream and defect prevention and the importance of low level automation, or automation below the user interface. I was also impressed with their passion for testing, but also their concern over the maturity of the discipline beyond bug finding expeditions, and career growth as an test engineer without having to become a manager.
This week I will continue the saga of posts on combinatorial testing. Sometimes it still surprises me how some people discount this technique, or simply assume that they can come up with a ‘better’ set of tests by randomly selecting ‘interesting’ combinations. I suspect some of the skepticism is a result of misapplication of the technique and/or tools, or based on white papers such as a widely distributed paper Pairwise Testing: A Best Practice That Isn’t. The title is certainly provocative, but unfortunately it is also very misleading. In fact, compared to other approaches to this problem, in the correct context there is a lot of empirical evidence to suggest that pairwise or combinatorial testing is in fact the best approach when used by a competent tester using powerful toolset.
However, this paper does a good job of pointing out several ways this technique is commonly misapplied, and also illustrates limitations of some tools. Unfortunately the paper fails to offer any other demonstrable solution to the combinatorial testing problem or to the percieved limitations of this technique. So, let’s take a look at the limitations or misapplication of this technique highlighted in the paper and propose potential solutions to help overcome those limitations and more effectively use this technique in the proper context.
(BTW…the “random selection” mentioned in the paper is not a person randomly selecting input combinations, but this study used a computer algorithm used to randomly select a set of combinations from all possible combinations. The actual study concluded “In this study we found no significant difference in the FDE of n-way and random combinatorial test suites. …the result is not unexpected.” The study is interesting from an academic perspective, but adds little value in the ‘real-world.’)
Selecting the ‘right values.’
The first misapplication of this technique identified in the paper is actually a non sequitur. “Pairwise testing fails when you don’t select the right values to test with.” Now perhaps this is self-evident, but any experienced tester can tell you that you can replace the first 2 words of this sentence with any other testing approach and experience the same results (e.g. exploratory testing fails when you don’t select the right values to test with”). Also, the inverse of this statement is not true. We can’t just arbitrarily say “pairwise testing succeeds when you select the right values to test with.” As I said previously, combinatorial testing is but one potential solution to a very complex problem; it is not a silver bullet in all situations.
But, this conclusion identified a common mistake in modeling the input variables based on a common misuse of the technique of equivalence (or domain) partitioning. Equivalent partitioning is also a modeling. Equivalent partitioning is grouping similar elements into sets based on a set of heuristics explained in the renowned book The Art of Software Testing by Glenford Myers. However, this technique is often misused by amateurs who
- fail to adequately identify special or unique values in any given set, and
- simply assume that we can take 1 or 2 elements of any large set and conclude they are representative of the entire set that we identified
The technique of equivalent partitioning depends largely on our ability to adequately model test data into the appropriate sets for the given context. Then we also must realize that our sets are a model or an assertion of how that data might be handled by the application. When dealing with a large set of input values it is fool-hearty to randomly select a limited set of values and hard-code those values into our tests for 2 reasons:
- the number of values selected from a large population of possible values is usually a lot less than required to gain any degree of confidence, and
- we eliminate the possibility of testing with any other values in that population
Increasing the number of possible values
If we artificially constrain our test values to “an extremely small subset of the actual number of possible values” then of course we limit the potential effectiveness of these techniques. So, how can we increase the number of values from a large population of possible values for any given input parameter?
For example, in the case of the simple font dialog used in the example we have a large set (3273) of possible font size values (1 – 1638 and half-sizes from 1.5 – 1637.5). Now certainly we don’t want to test every possible value! But, testing only a small subset of values might not provide the confidence necessary to support our assertion that all values in this range are valid and would result in the expected output state. So, perhaps we can increase we sub-divide our single large set into several smaller abstract subsets, and then randomly select values from each defined subset. Let’s see how this plays out…
For our Font Size input parameter instead of hard-coding values such as:
FontSize: 1, 8, 10, 12, 42, 72, 100, 256, 1024, 1638, 1.5, 1637.5, 11.5
What if we created abstract ranges such as:
FontSize: Small, SmallHalf, Nominal, NominalHalf, Large, LargeHalf, XLarge, XLargeHalf
And then we gave more concrete definition to our abstract ranges such as
- Small = 1 – 9
- SmallHalf = 1.5 – 9.5
- Nominal = 10 – 18
- NominalHalf = 10.5 – 18.5
- Large = 19 – 72
- LargeHalf = 19.5 – 72.5
- XLarge = 73 – 1638
- LLargeHalf = 73.5 – 1637.5
Now our output provides an abstract range and the tester (or automated test) has greater creativity over the value selected for that particular input parameter. In our demo there are 6 combination tests (out of 43) that require a nominal size value. Now, in your head randomly pick 6 numbers from 10 through 18; those are the values you use the first time you run this test. Next, ask someone else to randomly select 6 numbers from 10 through 18; those are the values you use in your test on the next build, or as appropriate. As you can see, the smaller the number of values in any given subset the greater the probability of testing with the ‘right’ values in that subset. But, more subsets require more tests (a non-issue if your combinatorial tests are automated using a data-driven automation approach). The larger the set of values in any given subset the fewer the number of values that will be tested. If a particular subset seems too large, then you can always sub-divide it into smaller subsets as well to get a better distribution of values from all possible values.
Now in our automated test script we have a method that sets the appropriate ranges based on the abstract value for the Font Size, and another method to randomly select a number from the range we specify similar to the 2 methods below:
Of course, if there were particular values in the font size range that we explicitly wanted to test then we could also easily specify them in our model of input values as well. But, hard-coding only a small subset of values from a large number (population) of possible values is rarely a good idea in any testing approach.
So essentially, to increase your data coverage and to increase your probability of testing with the ‘right’ values (especially if you don’t know what the ‘right’ values are) then one possible solution is to create smaller subsets and randomly select values in each subset each time the combinatorial test case is executed.
Also, you now are probably starting to understand that the limitation of any technique or approach is not necessarily the fault of that technique or approach, but due to the misuse or misapplication of that technique or approach by novice or untrained testers. Also, the more we understand about the ‘system’ we are testing the greater the effectiveness of the technique or approach we use when used in the appropriate context.
Later this week, I will discuss another potential solution to the other part of this problem…in a large set of possible values how do we increase the probability of testing with the ‘right’ values when we don’t know what the right values are.
Last week I started discussing combinatorial testing. Justin Hunter added some great comments and an link to his blog with several great posts on this subject as well. He also asked to elaborate on the training I designed inside of Microsoft to teach our testers as well as what I discuss in my workshops at conferences. Rather than try to write one post that covered 8 hours of lab based content I will attempt to capture the essence of the training in a series of posts starting with getting started.
First, let me say that combinatorial testing isn’t a sliver bullet. Similar to other techniques and approaches used in software testing it is susceptible to Beizer’s Pesticide Paradox. In other words, it is effective in finding certain categories of bugs when applied smartly in the correct context, but it is not effective in finding all categories of issues.
Combinatorial testing is most useful when testing complex configuration scenarios and/or situations where there are multiple input parameters with numerous variables per parameter that have some change effect on a single output condition or state. For example, it you need to test your application on multiple versions of Windows, and different browser versions, and different protocols and connection speeds then this testing technique will help define a baseline set of test environment configurations. Or, if you are testing an API that has several parameters with multiple arguments values that can be passed to those API parameters then this testing technique will also help testers establish a baseline set of tests. Of course, this can also be applied to input controls on a graphical user interface (GUI) that affect a common output state or condition such as how changes to the settings on a font dialog change the properties of the glyphs in an edit control.
It’s all about modeling
The key to combinatorial testing lies in the testers ability to identify the interdependent input parameters that act on the single output condition being tested, and create an abstract description of the input variables or parameter behavior in a model file that is used by a tool to produce a baseline set of combinatorial tests. In other words the effectiveness of baseline set of combinatorial tests is primarily based on:
- A tester’s ability to identify the appropriate input parameters or configuration parameters
- A tester’s ability to adequately identify the appropriate variables for each input parameter or configuration settings
- A tester’s ability to describe the parameters and the variables or settings in a model of the feature being tested
- The limitations of the tool used to produce a baseline set of combinatorial tests
Models are abstract representations of ideas or real objects. Creating models is not easy and takes a lot of creativity and critical thinking. For example, let’s refer back to the font dialog that I use in my demos. Then let’s look at the two checkbox controls for Bold and Italic. The most simple way to model these two inputs is to simply list each input parameter and the 2 check states for each checkbox.
Bold: check, uncheck
Italic: check, uncheck
This is the example I used in the article because it is easy for a novice to understand quickly. However, another way to represent these 2 inputs is to conceptualize what they are doing individually, but what happens to the output state or condition (the properties of the glyphs in an edit control). In this example when either or both checkboxes are checked or unchecked the style of the displayed glyphs is changing. So, another way to model the style inputs is:
Style: bold, italic, bold/italic, regular
Both models of these input parameters accomplish the same thing. I personally prefer the second abstract model of styles because developing the rules for mutually exclusive variables (I will discuss that in a few weeks) is a bit easier, and it is also easier to modify the model if the types of styles that can be applied increases in the future.
Honestly, I don’t think I can teach someone to model something via a blog, or a book. In a class I can demonstrate different ways to model something and then provide feedback at people work through problem. But, modeling is a skill that takes practice. And one thing I have learned in my experience is that the less we know about the feature being modeled the greater the probability of less than adequate outcome in our testing.
George Box said, “All models are wrong; some models are useful.” In this situation it is clearly our ability as testers to create a model of configuration or input parameters that is useful and provides value to us; otherwise it is simply wrong. It is not necessarily this technique that is broken; it is more likely that our limited understanding of what is being tested or misapplication of the technique that causes us to produce a less than adequate model. In other words, if your model is wrong, the tool output is certainly going to be wrong!
Fall is now upon us here in Seattle. I really like fall; the bright vibrant colors of the changing leaves, the crisp morning and evening air, leaves starting to blanket the lawn, and harvesting the crops from my garden. Of course along with the harvest comes the work of canning the veggies mixing up new batches of jam and conserves. But, it is fun work and it fills the house with delicious aromas that remind me of my boyhood and helping my mom in the kitchen canning the bounty of crops from our garden. Now I try to pass the tradition onto my daughter and we have fun trying different combinations of berries when we make our jam. My favorite is still just plain huckleberry, but our strawberry/blackberry mix is also darn good…probably because we pick the berries right from our backyard. But, not all combinations of berries work well in a recipe. The flavor of some berries overpower or mask the flavor of other berries. Similarly, in software testing not all combinations of input variables that directly impact a single common output work well together and result in a bug.
Pair-wise testing more specifically combinatorial testing, is a functional test technique intended to help testers more effectively expose issues caused by the interaction of 2 or more input variables that directly affect a common output state or condition. In simple situations a tester can often pick out various combinations of inputs based on likely customers settings, historical failure indicators (combinations of inputs that have been problematic in the past) and intuition. However, as the number of input parameters and the number of variables that can be applied to those interdependent input parameters increase in more complex features the potential number of combinations is overwhelming. Of course we still want to focus on common customer inputs, failure indicators and intuition. But, does guessing at other various combinations of inputs really provide us with sufficient confidence in our test coverage? Or, would a more systematic approach to testing combinations of input variables be more effective and more efficient?
There is a lot of empirical research both in academia and industry to suggest that the answer to this last question is…yes! Over the past few years there has been quite a lot written on the topic of pair-wise testing, and I and a few other people have presented at conferences on the topic. In fact, I recently published an article in Better Software magazine on the topic, and also gave a presentation at the recent VistaCon 2010 software testing conference in Vietnam. I have also posted my slides and the demo files (including the source code for a sample data-driven automated test) from the article and the conference presentation on my website.
In the coming weeks I intend to provide more information and tips to help testers think about how to model input parameters and variables for use in a tool to generate a subset of combinatorial tests and overcome some of the limitations of misuses of this best practice. Until then, if you have specific questions or comments please let me know.