Archive for the ‘Sampling’ tag
Combinatorial Testing: Selecting the ‘right’ values (Part 2)
In the previous post we discussed how hard-coding an extremely small subset of values out of a large population of all possible values for a given input parameter is rarely a good idea because it precludes any chance of testing with other values. By dividing a large population of possible input values into smaller subsets we can effectively increase our distribution across the range of possible values. And by randomly selecting values from each smaller subset each time a test specifies that subset in a combination we increase the probability of testing with a greater number of possible values.
But, another ‘cause’ of failure of combinatorial testing mentioned in the aforementioned paper was “Similarly, a number of "not found" faults were 2-way faults that were not detected because a particular combination of data values had not been selected.” This is indeed a difficult problem and there is no definitive solution. But, I have found over the years that professional testers are those individuals who constantly search for alternative ways to help them solve a difficult problem rather than simply complain about them. While there is no way to guarantee that we select a “particular combination” of data values that might expose an issue perhaps there is a way to test different combinations with different data values. Increasing variability is a great way to increase coverage and potentially trigger an error caused by unknown “particular combinations” or values for a given variable.
In the last post we modified our model file to use abstract ranges of font sizes and we are using a random number generator to select a value within each specified range to get a better distribution of font sizes across the entire population of possible values. This way we help prevent issues associated with using only a small number of hard-coded values in our tests. So now our model file for the font dialog is similar to:
# Basic Model File for MyFontDialog
Font: Arial, Tahoma, BrushScript, MonotypeCorsive
Style: Bold, Italic, BoldItalic, None
Effects: Strike, Underline, StrikeUnderline, None
Colors: Black, White, Red, Green, Blue, Yellow
Size: small, smallH, nominal, nominalH, large, largeH, xLarge, xLargeH, xxLarge, xxLargeH
Now instead of the over 1,250,000 combinations (assuming we would test each font value), the total number of combinations for the font dialog based on our current model is 3840 (assuming all combinations are valid). Using this model the the PICT tool will generate approximately 60 test combinations as the baseline set. But, the baseline set of tests generated by the tool is only 1.5% of all combinations. Most studies indicate the baseline set of combinations to be pretty effective in defect detection effectiveness (DDE) and improved code coverage. But, the baseline set of tests may not include hidden or unknown ‘particular combinations’ that might be problematic.
If we were to visualize the total number of combinations as the large circle and the baseline set of tests as the smaller red circle we can see that the baseline only covers a relatively small portion of the total number of combinations. Obviously there are a lot of other combinations that are not being tested. If we know of particular combinations that should be tested we can easily include those in our set of n-way tests (and we will see how we can do that in a later post). But, if we don’t know what ‘particular combinations’ might cause an error we need to find a way to more effectively increase the number of combinations tested.
Unfortunately, most combinatorial tools will only generate a single baseline set of test combinations. Even if we randomize the values for any given input parameter we are most likely only to expose single mode or one-way faults (errors that are caused by a single input parameter value regardless of the other input parameter values).
The question becomes how can we effectively expand the our test coverage to include different combinations beyond our original (or only) baseline set of combinations. If we select different ‘sets’ of combinations we are effectively testing different combinations in the population of all possible combinations.
Fortunately the PICT tool has the ability to generate random sets of combinations from a single model file. By passing a ‘/r:n’ switch argument in the command line to call PICT the tool will generate a different set of combinations. The outputs below illustrate 2 different sets of combinations generated by the from our basic model file.
The set of tests in the simpleout1.xls file are the baseline set of tests generated by the PICT tool. Most tools are only capable of generating a single baseline set of tests. However, using the the PICT tool we passed the /r:42 switch as a command line argument. With a user defined seed value the PICT tool generated a different baseline set of tests illustrated in the simpleout2.xls file on the right.
Occasionally there may be duplicate combinations in the various sets of tests. But, if we generate a ‘new’ set by passing different seed values with the /r switch for each new build we can effectively increase our test coverage and the probability of testing (unknown) ‘particular combinations’ by changing the set of combinations generated by the tool.
How many tests do we need?
A common question that is often asked is, “given a large number of possible tests or values, how many tests or values do we need to test?” There is no simple answer to this question. But, we do know that testing with an extremely small subset of values (or combinations) may not provide us with high levels of confidence.
When dealing with a large population of possible combinatorial tests (or values in a variable), one way to improve confidence is to increase the sample, or the number of different combinations tested. In a previous post I discussed the concept of sampling from a testing point of view. Sampling is often used in scientific research and experimentation. For example, if we assert that all 1,250,000+ combinations should produce an equivalent output (in our case the appropriate changes to the glyphs in our edit control), then we can increase our confidence that assertion holds true by increasing the number of samples (n-way test combinations).
We would only be 100% confident if we tested all possible values for all possible combinations. But, that may not be feasible in all cases, and the output of combinatorial test tools tend to optimize on a minimum or baseline set of tests. So, one approach to help us increase our confidence in test coverage is to use a statistical sample calculator to help us approximate the number of samples (or different combinations) that we should test to achieve a desired level of confidence. In our demo, we stated there are now 3840 possible combinations (assuming all combinations are valid). Given this population of possible tests the number of samples (different combinations) we would need for a statistical confidence of 99% with a 3% sampling error and a standard deviation of .5 is 1248.
(NOTE: In statistical sampling, the smaller the total population, the greater the number of samples. Automating your combinatorial test for a non-trivial feature is a best practice and can help us effectively expand our coverage of n-way combinations.)
Of course, even if we tested 1248 different combinations there is still no guarantee that we will test the ‘particular combinations’ that might trigger an unexpected anomaly. But, no other approach to testing can guarantee we test with these unknown ‘particular combinations’ either. However, systematically increasing the sample size of any given population of possible combinations (or values) might likely increase our likelihood of exposing an anomaly not detected by a static output from a basic combinatorial testing tool, or at least increase of overall confidence.
Testing essentially helps provide confidence and reduce risk because we can’t test everything!
In the next post we will discuss how to effectively deal with invalid variable combinations.
Testing is Sampling
Originally Published Thursday, July 16, 2009
It seems it is about this time of year that I need to detach a bit from the world to reflect back on the past year and reevaluate my personal and professional goals moving forward. Perhaps I am just getting older or perhaps just a bit wiser (that is synonymous with ‘sapient’ for the C-D crowd), but I find it refreshing to break away this time of year to tend to my gardens, work on my boat, read some novels, and contemplate life’s joys. Now, the major work projects are (almost) finished on my boat, the garden is planted and we are harvesting the early produce, and I reset both personal and professional development objectives for the next year and beyond. So, let me get back to sharing some of my ideas about testing.
Many of you who read this blog also know of my website Testing Mentor where I post a few job aids and random test data generation tools I’ve created. I am a big proponent of random test data using an approach I refer to as probabilistic stochastic test data. In May I was in Dusseldorf, Germany at the Software & Systems Quality Conference to present a talk on my approach. I especially enjoy these SQS conferences (now igniteQ) because the attendees are a mix of industry experts and academia, and I was looking for feedback on my approach. I call my approach probabilistic stochastic test generation because the process is a bit more complex than simple random data generation. Similar to random data generation we cannot absolutely predict a probabilistic system, but we can control the feasibility of specified behaviors. And the adjective stochastic simply means "pertaining to a process involving a randomly determined sequence of observations each of which is considered as a sample of one element from a probability distribution." In a nutshell, my approach involves segregating the population into equivalence partitions, then randomly selects elements from specified parameterized equivalence partitions (which is how we know the probability of specific behaviors), finally the data may be mutated until the test data satisfies the defined fitness criteria. By combining equivalence partitioning and basic evolutionary computation (EA) concepts it is possible to generate large amounts of random test data that is representative from a virtually infinite population of possible data.
One of the questions that came up during the presentation was how many random samples are required for confidence in any given test case; in other words how to we determine the number of tests using randomly generated test data? This is not an easy question to answer because the sample size of any given population depends on several factors such as:
- variability of data
- precision of measurement
- population size
- risk factors
- allowable sampling error
- purpose of experiment or test
- probability of selecting "bad" or uninteresting data
Using sampling for equivalence class partition testing
But, the question also brought to mind a parallel discussion regarding how we go about selecting elements from equivalence class partition subsets. I am adamantly opposed to hard-coding test data in a test case (automated or manual), but a colleague challenged me and said that since any element in an equivalent partition is representative of all elements in that partition then why can’t we simple choose a few values from that equivalence subset. I realize this approach is done all the time by many testers; which is perhaps why we sometimes miss problems. But, hard-coding some small subset of values from a relatively large population of possible values is rarely a good idea, and is generally not the most effective approach for robust test design. One problem with hard-coding a variable is that the hard-coded value becomes static, and we know that static test data loses its effectiveness over time in subsequent tests using the same exact test data. Also, by hard-coding specific values in range of values means that we have absolutely 0% probability of including any other values in that range that are not specified. Another problem with hard-coded values stems from the selection criteria used to choose the values from a set of possible values. Typically we select values from a set based on based historical failure indicators, customer data, and our own biased judgment or intuition of ‘interesting’ values.
However, the problem is that any equivalence class partition is a hypothesis that all elements are equal. Of course, the only way to validate or affirm that hypothesis is to test the entire population of the given equivalence class partition. Using customer-like values, or values based on failure indicators, and especially values we select based on our intuition are biased samples of the population, and may only represent a small portion of the entire population. Also, the number of values selected from any given equivalence partition set is usually fewer than the number required for some reasonable level of statistical confidence. So, while we definitely want to include values representative of our customers, values derived from historical failure indicators, and even our own intuition, we should also apply scientific sampling methods and include unbiased, randomly sampled values or elements from our set of values or population to help reduce uncertainty and increase confidence.
For example, lets say that we are testing font size in Microsoft Word. Most font sizes range from 1pt through 1638pt and include half-sized fonts as well within that range. That is a population size of 3273 possible values. If we suspected that any value in the population had an equal probability of causing an error the standard deviation would be 50%. In this example, we would need a sample size of 343 statistically unbiased randomly selected values from the population to assert a 95% confidence level with a sampling error or precision of ±5%. Even in this situation, the number of values may appear to be quite large if the tests are manually executed which is perhaps one reason why extremely small subsets of hard-coded values fail to find problems that are exposed by other values within that equivalent partition (all too often after the software is released). Fortunately, statistical sampling is much easier and less costly with automated test cases and probabilistic random test data generation.
Testing is Sampling
Statistical sampling is commonly used for experimentation in natural sciences as well as studies in social sciences (where I first learned it while studying sociology an anthropology). And, if we really stop to think about it; any testing effort is simply a sample of tests of the virtually impossible infinite population of possible tests. Of course, there is always the probability that sampling misses or overlooks something interesting. But, this is true of any approach to testing and explained by B. Beizer’s Pesticide Paradox. The question we must ask ourselves is will statistical sampling of values in equivalence partitions or other test data help improve my confidence when used in conjunction with customer representative data, historical data, and data we intuit based on experience and knowledge? Will scientifically quantified empirical evidence help increase the confidence of the decision makers?
In my opinion anything that helps improve confidence and provides empirical evidence is valuable, and statistical sampling is a tool we should understand put into our professional testing toolbox. There are several well established formulas for calculating sample size that can help us establish a baseline for a desired confidence level. But, rather than belabor you with formulas, I decided to whip together a Statistical Sample Size Calculator that I posted to CodePlex and also on my Testing Mentor site to help testers determine the minimum number of samples of statistically unbiased randomly generated test data from a given equivalence partition to use in a test case to help establish a statistically reliable level of confidence.
Cockamamie chaos causes confusion; controlled chaos cultivates confidence!