Skip to content

Evaluating Exploratory Testing

This month’s issue of Testing Experience published my article that summarizes the findings of several case studies of exploratory testing both inside and outside of Microsoft. Although some people consider me to be a harsh critic of exploratory testing nothing could be further from the truth. When I started my career as a professional tester my approach to software testing was primarily exploratory in nature. I was focused on executing as many negative tests I could possibly conceive of in search of the most heinous bugs I could find; and I was good at it. My criticism is not of exploratory testing as an approach; however, I do ‘question’ the claim that claim exploratory testing is “orders of magnitude more productive.” And, I am also critical of the argument that we don’t understand exploratory testing if we don’t conform to one notion of the concept (or buy into an ideological doctrine) because I don’t believe that there is only one ‘right’ way to perform or think about exploratory testing.

Of course, I know it is un-unpopular to question the claims of exploratory testing ‘experts,’ but I just happen to be one of those people who question things that are founded on anecdotal observations without any hard data to substantiate those claims. I certainly don’t have all the information, but I personally like to be able to back up my position with facts (known at the time) and several verifiable/repeatable data points so I can answer questions from a defendable position rather than trying to convince or cajole someone with my subjective opinion. (I know a lot of studies show that many Americans base their decisions on their emotional state at the time. But I learned a long time ago that you should never buy the boat you fall in love with because you will spend more time maintaining her than sailing her.) Also, it’s easier to persuade me that I might be wrong with solid, verifiable information and repeatable data versus emotional rhetoric or personal insults.

I think most people who promote exploratory testing are well intentioned and realize in conjunction with other testing approaches that exploratory testing adds value to any testing effort. I also think that many practitioners realize that while we must not only hone our intellectual capabilities of critical thinking and logical reasoning, we must also constantly build our knowledge and skills of the other approaches, methods, and techniques used in our professional trade.

At Microsoft, I can’t think of any testing group that does not use exploratory testing as part of its overall strategy. We have learned not to rely on exploratory testing as our primary approach because it simply doesn’t scale as project size and complexity increase, and it is easy for testers to focus too much on out of context issues in hopes of finding another bug. As one Principal Test Manager summarized, exploratory testing helps

  • flush out “low hanging fruit” (identify obvious issues very quickly)
  • provide welcomed context switching by getting folks to look at other areas of the product
  • to seed new testing ideas or helps identify holes (which is great as long as we have a way to preserve those ideas and they are learnable by other testers)

But, of course, it was also noted that greater ‘system knowledge’ and an understanding of other various testing techniques and approaches enriched the overall effectiveness of the testers on the teams. My job as a teacher and mentor of software testing is to take really smart people who already know how to think critically about problems and provide them with the foundational knowledge of alternative techniques, methods, approaches, and the skills that are specific to the profession of software testing that will enable them to decide what approach to use depending on the context.

Similar to other testing approaches exploratory testing has benefits and limitations and is more effective in exposing certain categories of issues, and is less effective at exposing other types of problems. (See post on Pesticide Paradox.) And now we have researched case studies that begin to help us understand how to utilize exploratory testing as part of our overall testing strategy. Of course, further research could be done in this area, but it is very interesting that the independent studies used in the article reached similar findings and conclusions.

Anyway, I look forward to comments or feedback on the article.

Refactoring for Testability

 

Teaching my daughter about bullet seating depth.

Teaching my daughter about bullet seating depth.

One of my hobbies is shooting CMP matches and long range precision shooting. Besides lots of practice perfecting the techniques a big part of precision shooting depends on the ammunition and studying the ballistic patterns of various loads. All precision shooters custom load their ammunition and it is not as simple as simply reading a reloading manual. Slight variations of .001” of an inch in seating depth of a bullet or .1 grain of powder may determine whether the group of shots at a target 600 yards away is 1” MOA or 6” MOA. So, getting the ammunition to match the rifle requires continually analyzing your shots, making slight adjustments to the load, and repeating; in computer jargon we might call that refactoring. Reloading for precision is a continually optimizing process until we find the optimal load. Similarly, one of the things we do in the Engineering Excellence group at Microsoft is to continually analyze our internal processes and practices to see how we can help our business groups constantly improve and optimize towards their target. One of the big things on our plate these days is testability.

In Testing Object-Oriented Systems: Models, Patterns, and Tools, a book I consider one of the most important books on software testing practices, the author Robert Binder defines testability as “The relative ease or difficulty of producing and executing an economically feasible test suite to determine whether the [system under test ] SUT (i) conforms to stated requirements and specifications, and (ii) exhibits an acceptably low probability of failure.” This and several definitions of testability floating around on the web and all generally agree that testability generally involves

1.) The ease with which the SUT can be tested
2.) The cost of testing is reasonable

So, as the testability increases the ease with which our tests can determine whether the SUT satisfies implicit and explicit requirements and has a lower chance of failure at reduced testing costs. This all sounds nice, but unfortunately testability cannot be directly measured; testability is a qualitative measure. Although we can’t accurately measure testability we can sometimes do small things to improve the characteristics of testability and help reduce testing costs by reducing the number of tests required to determine whether the SUT satisfies the stated requirements and also has a low chance of failure, or finding ways to test more efficiently through better designs.

In last week’s post I referred a pseudo code example that was written to illustrate how bugs could linger in code despite a high measure of code coverage. Of course we should realize that pseudo code is generally a far cry from the real implementation of the code. Pseudo code is simply a model, and there are many ways to implement that model. The advantage of a model is that we can often test a model earlier to identify potential issues before a single line of code is written. In this particular pseudo code sample, there were a couple of things that stood out that could likely impact the testability of an implementation of the pseudo code model. So, the neurons in my brain starting firing with lots of testing related questions.

So, let’s use that example to discuss potential testability issues. The sample was based on a requirement that stated “Student ID’ are seven digit numbers between one million and 6 million inclusive.” The function is relatively simple in that it takes a string type passed to the sid parameter, and returns a Boolean true or false to the calling function depending on whether the string satisfies the internal Boolean conditions it is being compared against. But this function also calls 2 other functions; the length () function, and the number () function. From the function names I would think the length () function provides a numeric value that represents the number of characters in the string passed to the sid parameter. I am also betting the number () function returns a numeric value (it converts the string variable to a numeric type such as an integer. The pseudo code example was

function validate_studentid(string sid) return
TRUEFALSE
BEGIN
  STATIC TRUEFALSE isOk;
  isOk = true;

  if ((length(sid) is not 7) then
    isOk = False;

  if (number(sid) <= 1000000 or number(sid) > 6000000 then
     isOk = False;

  return isOk;

END

One of the reasons that we hire testers with a programming background at Microsoft is that they can help the developer identify potential issues, reduce the probability of failure, and improve testability by stepping through the code during peer reviews, or while designing additional tests to cover un-tested or under-tested areas of the code that are exposed by code coverage analysis. So, when I come across a code sample, I generally step through it to

  • See if it will work as intended (basic unit test)
  • See if there are any potential obvious errors in logic
  • Identify tests necessary for branch or conditional coverage (because developers are usually only concerned with block coverage)
  • Identify argument values for negative testing that might expose undesirable results (bugs)

So, in this pseudo code example, once I got to the second conditional clause (if (number (sid)) <= 1000000 or number (sid) > 6000000 then) the little cranks in my brain began to turn. I thought to myself, why are we checking the length of the string? I mean, if the number can only be between 1,000,000 and 6,000,000 then it seems to me that checking the length of the string is simply redundant.

If we remove the first conditional clause (if ((length(sid) is not 7) then) then we actually reduce the number of tests to 3 instead of 4 assuming short-circuiting since short-circuiting compound Boolean expressions is one of several code optimization techniques. (By the way, the first caveat example in Wikipedia on short-circuiting where a function used as a Boolean conditional also “performs some required operation regardless of whether the first conditional evaluates true or false” is simply poor architectural design and is very, very likely to be problematic.) The 3 tests for condition (and basis path) coverage to exercise the true and false outcome of every single Boolean conditional expression are listed in the table below.

  Conditional 1 Conditional 2  
Test number (sid) <= 1000000 number (sid) > 6000000 Expected Result
Any value between 1000000 and 6000000 false false true
Any value > 6000000 false true false
Any value < 1000000 true (short-circuited) false

Of course, even testing several samples from the equivalent partitions may not expose the bug in this code because the bug in this code is a typical boundary error. (In a previous post I explained the basic fault model that caused many boundary issues. In a nutshell, boundary bugs are generally caused by incorrect relational operators or magic numbers in code.) Without recognizing that we also need to test the boundaries (999999, 1000000, 1000001, and 5999999, 6000000, 6000001) also we could easily overlook the error in the pseudo code.

Another thing that caught my attention was the lack of exception handling. Some people may not consider including exception handling in pseudo code and take it as a given. But, as a tester when I don’t exception handling in pseudo code in a review then I need to start asking questions so I can better design tests to exercise the exception handling control flow paths that directly impact code coverage measures. Another reason this is an important consideration is because results of code coverage analysis indicates that exception handlers are generally under-tested. It seems we are really good at finding unhandled exceptions with our negative tests (which is really good), but we do not seem to be as thorough in testing the logical code paths of exception handlers. This is especially true for predicate statement with multiple Boolean sub-expressions might trigger an exception. We tend to test one of the conditionals, and the other conditionals expressions in that statement are often under-tested.

So, we can surmise the number () function must be converting the string parameter (the sid variable) to a numeric type and returning a type of number because the conditional clause is comparing it to magic numbers (1000000 and 6000000). But if we entered a string that contained non-numeric characters my initial thought was that the number () function would throw an exception that is unhandled by the validate_studentID () function.

Then I thought a bit more, and considered that the number () function might swallow the exception and return a 0 or even a -1. Now, there are some arguments in favor of swallowing exceptions, but in general it is not a good idea. In this case, it is probably a bad idea because one of the primary purposes of a separate function is reusability. If the number () is reused in some other code, or other part of the code where we need to convert a string to a numeric type regardless of the range (within the range of the data type being converted to), I would suspect we would want to throw an exception, and then rethrow the exception in the calling function. Of course, this is where the rubber hits the road, and a professional tester needs to dig in and start asking some hard questions as to how the developer is going to handle this situation. If the number () function is not going to be reused, then most modern programming languages include a function call that will easily convert the string to a numeric type and do it more efficiently as compared to calling a separate function. And may in that case we could swallow the exception in the validate_studentID () function and simply return false as illustrated in the C# code below.

   1: try

   2: {

   3:     if (int.Parse(sid) < minValue || int.Parse(sid) > maxValue)

   4:     {

   5:         isOk = false;

   6:     }

   7: }

   8: catch (FormatException)

   9: {

  10:     isOk = false;

  11: }

  12: catch (OverflowException)

  13: {

  14:     isOk = false;

  15: }

With the push to drive quality upstream, reduce costs (especially testing costs), and improve testability I envision that many testers will be working alongside our development counterparts to help them prevent defects from getting into the product code base, and improve the maintainability of the code. This doesn’t mean that testers will become developers or visa versa; it simply means that testers are (generally) experts in designing tests, and developers are experts in designing solutions that adhere to requirements. Rather than an adversarial relationship, I suspect in the future developers and testers will have a more symbiotic relationship to improve the intrinsic quality of our code bases.

The bottom line of all this is that in teams where testers are designing white box tests for improved code coverage (control flow testing), or where testers are engaged in design reviews or peer reviews of code prior to check in, I hope this gives you some things to think about.

Reconsidering Code Coverage

Tonight on my way to teach a test automation course at the University of Washington I had some free time to catch up on my reading. My manager asked me if I had read this month’s copy of one of the several testing magazines we get and I replied that I had downloaded it but hadn’t had a chance to read it yet. So, he tossed me the hardcopy of the magazine and said, “Enjoy.” Now this should have been a clue because although Alan is a great manager and mentor, I think he secretly likes to see the veins in my neck swell and blood shoot out of my eyes from time to time.

I read a lot of articles, white papers, and books. I like most of what I read, even if I disagree with some of the points being made. I can’t remember ever reading an article on software testing that ever made me angry. I was not angry because of the message of the article. In fact, I think the point the authors are trying to make is valid and I agree with them on their fundamental point. Unfortunately, the article is filled with technical inaccuracies the end message was almost lost.

I spent the last 10 years studying various techniques, methods, and approaches in software testing. I teach more than 500 testers a year on structural testing techniques, and am now working with a team in the Windows division to implement a new tool for just in time code coverage analysis at the component level that allows us to see how our tests exercise code paths in changed code and the dependent modules. I also discuss structural testing in chapter 5 of our book How We Test Software At Microsoft. I don’t really consider myself to be an expert in the subject, but I might know a thing or two about it. So, let’s Reconsider Code Coverage!

In August 2007 I wrote an informative blog post on the potential misuse of the code coverage measure. But code coverage measures are used by some companies as one of many ways to help them reduce risk. And, let me be very clear here, there is no correlation between code coverage and quality, and code coverage measures don’t tell us “how well” the code was tested. The code coverage measure simply measures what code has been executed, and more importantly what code has not been executed. The value of measuring code coverage is not in producing some “magic number,” but that it helps testers investigate untested or under-tested areas of the product and design additional tests (generally using structural testing techniques) to improve coverage and reduce overall risk.

Just because you execute a line of code doesn’t mean a bug doesn’t still exist, but if you don’t execute a line of code you have 0 probability of finding a bug if one exists!

Also it is important to note there are several ways to measure code coverage. Different tools employ different measures and sometimes different tools measure the same type of coverage differently. Also, I discovered that even the same tool can measure the same code differently depending on how it is compiled (debug, retail, etc.) and previously wrote about my study. Some of the basic ways to measure code coverage (not test coverage) include:

  • Function coverage measures the percentage of functions or methods in a class or application that are called at runtime.
  • Statement coverage measures the percentage of executable statements exercised at runtime.
  • Block coverage measures the percentage of each sequence of non-branching statements that are executed at runtime. Block coverage subsumes statement coverage.
  • Decision or branch coverage measures the percentage of both Boolean (not binary) outcomes (true and false) of simple conditional expressions at runtime. If a predicate statement has more than one conditional sub-expression decision (or branch) coverage treats that predicate statement as one conditional clause. Decision coverage subsumes block coverage.
  • Condition coverage measures the percentage of both Boolean outcomes of each conditional sub-expressions that are separated by logical and or logical or in compound predicate statements. Condition coverage subsumes decision coverage.
  • Basis path coverage measures the number of linearly independent paths through a program. Basis path coverage is based on McCabe’s cyclomatic complexity research.
  • Path coverage measures every possible path from the entry to the return statement (or exception) or exit of every method. Unfortunately path testing is usually impossible due to the sheer number of path combinations, and the inability to execute constrained path combinations.

Clearly there are different measures of code coverage, and certain types of measures subsume other measures. So, now that we have a handle on the different types of code coverage measures, let’s look at testing some code. We will use the same pseudo code used in the aforementioned article which is based upon the following requirement.

“Student ID’ are seven digit numbers between one million and 6 million inclusive.”

The authors provided the following pseudo code example for a function to meet this requirement.

function validate_studentid(string sid) return
TRUEFALSE
BEGIN
  STATIC TRUEFALSE isOk;
  isOk = true;

  if ((length(sid) is not 7) then
    isOk = False;

  if (number(sid) <= 1000000 or number(sid) > 6000000 then
     isOk = False;

  return isOk;

END

So, other than the fact that there is no reason to ‘test’ the length of the sid variable before evaluating it to see if it is within the allowable range (removing this first conditional improves performance and also improves testability of the code), and that if the call to the number() function fails to convert the string to a number for a valid Boolean comparison it will throw an unhandled exception, let’s look at path testing of this simple example by starting with control flow diagrams of each possible path (assuming the call to the number() function does not throw an unhandled exception by passing this message a string of characters such as “foo” rather than a string of digits).

Control flow diagram for validate_studentID() function pseudo-code

Control flow diagram for validate_studentID() function pseudo-code

(Edited 11/25: After thinking about this a bit more, if the number() function returned a 0 (zero) if the input was incorrectly formatted, then the number() function would not throw an exception, and the control flow path would be identical to the first test in the table below).

Because we are doing path coverage testing and not decision testing, we actually have to separate each Boolean conditional sub-expression in the second compound predicate statement if (number(sid) <= 1000000 or number(sid) > 600000. The example in the article treated both sub-expressions in the compound predicate statement as a single Boolean expression which would be synonymous with decision coverage. Path coverage actually treats each sub-expression as if there were 2 single Boolean conditions such as

if (number(sid) <= 1000000
  isOk = False;

if number(sid) > 600000
  isOk = False;

The table below illustrates the tests required for testing control flow through this function for path coverage (again assuming we are going to ignore the unhandled exception in the code that would occur by passing in a string such as “foo.”)

Input (sid) Conditional
length(sid)!= 7
Conditional
number <= 1mill
Conditional
number > 6mil
Expected
Result
Actual
Result
999999 true true false False False
6500000 false false true False False
1000000 false true false True False
6000000 false false false True True

The first test would be a value less than 7 digits, and would cause all Boolean conditional expressions to evaluate as true which will set the isOk variable to false (3 times), and we correctly return the expected result of false (or invalid ID). The second test is a number greater than 6,000,000 (but less than the maximum value that would result in an unhandled overflow exception hopefully being thrown by the number() function). In this case the 3rd conditional expression (if (number(sid) > 6000000) would evaluate as true and the function would return false. The 3rd path is buggy. In this pseudo code example, the only possible way to exercise the true outcome of the Boolean condition if (number(sid) <= 1000000 is to use the value of 1,000,000; any other value larger or smaller will cause this Boolean condition to evaluate as false. In this case we expect the function to return true, but it in fact will return false. Finally, any number greater than 1000001 and less than or equal to 6000000 will return a true result indicating a valid student ID.

The article also suggest that structural testing misses other problems. But, when we look at these issues, they actually have nothing to do with structural testing of the function; in other words they are completely out of context of the problem being discussed.

For example, the assert is the requirement is incorrect and should have read 6,999,999 (which I believe is a typo and should be 5,999,999) because of confusion over the word “inclusive.” Inclusive means “including the stated limit or extremes in consideration or account,” but in computing inclusive means “the predicate holds for all elements of an increasing sequence then it holds for their least upper bound.” I disagree with this assumption because I suspect the analyst writing the spec is basing the inclusive range on the common definition, and not a definition based on domain theory.

The article questions what would occur with incorrectly formatted numbers such as 123 456 789 or 123,456,789. So, beside the point that these values are not within the valid range of student id numbers, the answer to the question would actually lie in how the number() function being called handles improperly formatted numbers (e.g throwing a format exception, which again is unhandled in our validate_studentid() function), or how an event handler that sits between the UI and the function might deal with invalid or incorrectly formatted inputs.

The next question concerned resizing of the input window or the screen (assuming desktop resolution) and repainting the window or form and its affect on code coverage of the validate_studentid() function. Well, I am going out on a limb here and I am going to say…”what are you talking about?” I am not quite sure how to phrase this, but let me try…resizing or repainting a window has 0 effect on the structural control flow of the validate_studentid() function. (Of course, I could be wrong, and the length() function number() function might have some code that mysteriously interacts with the repainting libraries and how it determines the length of a string or whether a string is a valid number.)

Bugs in external libraries are part of the business. Hopefully those external libraries are well tested or at least documented especially if our development team wrote them. Personally, I have not encountered any public functions or APIs which use wild ass random numbers such as 5.8 million as boundary values, but that’s not to say it couldn’t happen. And of course, if these external functions throw exceptions (as they should based on what they are probably doing), we should have exception handler code in our function to deal with any exceptions thrown from external libraries or function calls.

Based on incorrect path analysis, and out-of-context questions that have nothing to do with control flow through the validate_studentid() function the article suggests that path testing is not a magic potion, but I am not too sure that anyone actually believes it is. And so, the article suggests that “input combinatorics coverage” might work better. Hmm…now I have been teaching combinatorial testing for over 10 years and have read some interesting papers on the effectiveness of combinatorics on statistical testing and code coverage, and I must say I pretty sure you need more than one input parameter in combinatorial testing!

Finally, I don’t agree that code coverage measures tell us “how well the developers have tested their code.” The code coverage measure only tells us what percentage of the code has been executed in a particular way, and more importantly it tells us how what percentage of code has been untested. We must determine whether we need to investigate that area to reduce risk. Of course, many code coverage tools provide a “heat map” that helps us and developers identify untested code, and that is where we shift from the simple act of measuring coverage to the testing method of code coverage analysis in order to design new tests that effectively exercise previously untested code if that level of coverage is important to reduce overall risk.

heat map

My intent here is not to ridicule the authors of the article. In fact, I agree with their summation that testers should not believe high code coverage numbers mean “well tested.” (Again see my blog post from Aug 2007.) Unfortunately, the path to the point was fraught with inaccuracies and tangents that I almost never made it to the end.

There are many books and white papers on this subject in the ACM and IEEE libraries. Books by Boris Beizer, Robert Binder, and others go into great detail on structural testing. McCabe’s papers linked to in this post are an excellent resources.

OK…I feel better now. I need to clean up the blood, take a sedative, and go to sleep.

The Pesticide Paradox

Some of you know that I am an organic gardener. I do this not only because I like the fresh taste of vegetables and fruits that are not tainted with chemicals, but my daughter loves to eat things right off the vine or stem as we are harvesting our bounty and I am perhaps overly protective of my daughter. Here in the Pacific Northwest slugs are a particular problem, and I must use several different techniques and approaches to try to ward off or destroy these nasty, slimy creatures throughout the year because no single approach is 100% effective. Yes it is unfortunately true, even with my diligent efforts and myriad of tactics a few slugs still get by into the raised gardens. This sure seems a lot like software testing and how some tests find some bugs and miss others, and how iterative builds seem to allow new bugs to continually creep in. But, this really isn’t a revelation.

In 1983, in his seminal book Software Testing Techniques, Boris Beizer compared the diminishing effects of insecticides on boll weevils destroying cotton fields to the decreasing effectiveness of testing methods in exposing defects in software. This became well known as the Pesticide Paradox. The Pesticide Paradox states “Every method you use to prevent or find bugs leaves a residue of subtler bugs against which those methods are ineffectual.”

Now, unlike insects (or pests), software doesn’t magically build up an immunity to bugs. What happens is that we design and execute tests using one approach that exposes some set of issues. But, then the number of issues being exposed by that approach starts to diminish, yet there are still some ‘residue of subtler bugs’ that haven’t been detected by that set of tests or testing approach. Also similar to how the ladybugs in my garden that eat aphids and mites have no effect on slugs, not all approaches or techniques we might use in our testing is effective in detecting or preventing all types of bugs.

For the past 10 years, I have been teaching various software testing practices and approaches for solving complex problems at Microsoft. One cool aspect of my job is that I get to experiment a lot in a reasonably controlled environment with diverse groups of people. We often design these studies to better understand the benefits and limitations of various testing approaches and methods in exposing different categories of defects to better understand how each approach can be used more effectively within the appropriate context. Of course, it should be of no surprise to anyone that the pesticide paradox holds true not just in the classroom, but also in practice.

I often explain Beizer’s paradox by stating, “there is no single approach or method used in software testing that is completely effective in exposing all bugs, and some approaches or methods are more effective in exposing different types or categories of bugs.”

At Microsoft there is no “one size fits all” solution, and the Engineering Excellence group doesn’t dictate how to test or what testing methods can or can’t be used. But, through a series of problem solving exercises in our new SDET training program each tester experiences the benefits and limitations of various approaches and techniques. Based on their experiences in the classroom and in ‘real life’ they also learn the most effective strategy for testing is not to rely too heavily on a single approach, and to use a variety of test design principles and patterns throughout the product lifecycle. But, that’s just how we roll.

Software Testing Techniques 2nd Eidition

Localization Testing Part IV

Originally Published Thursday, November 12, 2009

The past series of posts have focused on one of localization testing which describes the largest category of localization class issues reported by testers performing localization testing, and what we categorize as usability/behavioral type issues because they adversely impact the usability of the software or how end users interact with the product. This is the last post in this series, but I do intend to publish a more complete paper covering localization testing in the near future….stay tuned. This final post in this series will discuss issues that affect the layout of controls on a dialog or window and are generally referred to as clipping or truncation.

Clipping

Clipping occurs when the top or bottom portion of a control (including label controls that contain static text) is cut off and does not display the control or the control’s contents completely as illustrated below. Clipping and truncation is quite common on East Asian language versions because the default font size used in Japanese, Korean, and Chinese language versions is a 9 point font instead of the 8 point font used in English and other language versions. Clipping often occurs because developers fail to size controls adequately for larger fonts (especially common in East Asian language versions), or for display resolutions set to custom font sizes. Clipping also occurs because many localization tools are incapable of displaying a true WYSIWYG or runtime view of dialogs, requiring localizers to ‘guess’ when resizing control on dialog layouts.

clipping

It is possible to test for potential clipping and truncation problem areas without a localized application. English language version should function and display properly on all localized language versions of the Windows operating system. So, one way to check for potential clipping or truncation issues is to install the English language version of the application under test on an East Asian language version of the Windows operating system. Another testing method to test for potential clipping and truncation issues is to change the Windows display appearance or the custom font size via the Display Properties control panel applet.

However, due to the limitations of most current localization tools inability to dynamically resized controls and dialogs, and inability to display dialogs at runtime or present a true WYSIWYG view during the localization process, the localized language versions must also be tested for clipping and truncations caused by improper sizing and layout of controls.

Truncation

Truncation is similar to clipping, but typically occurs when the right side of controls are cut off (or the left side of the controls in bi-directional displays used in Hebrew and Arabic languages) and do not completely display the entire control or the control’s contents.

truncation

Other Layout Issues

Because some localization tools may not provide a true ‘WYSIWYG’ display of what a dialog or property sheet will look like at runtime, occasionally resizing may cause several controls to overlap. This is especially true when dialogs contain dynamic controls that are dependent on certain configurations or machine states.

image

In East Asian cultures it is common for an individual’s surname to precede the given (first) name. (It is also uncommon to have a middle name, so this field should never be required.) Therefore, the controls for name type data may need to be repositioned on dialogs in East Asian language versions. The localization team will reposition the last name label and textbox controls and the given name controls. This means that the logical tab order be reset. Also, the surname textbox control should have focus when the dialog is displayed instead of the first given name field.

clip_image002clip_image002[5]

The tab order of controls should allow for easy, intuitive navigation of a dialog. Design guidelines suggest a tab order that changes the focus of controls from left to right and top to bottom. Focus should change between each control in a logical order, and dialogs should never have loss of tab focus’ where no control on the dialog appears to have focus.

Tab order is typically problematic even in English language versions in the early lifecycle of many projects when the user interface is in flux. There is also a high probability of introducing tab order problems any time the controls on a dialog change.

All localization testing doesn’t have to be manual

In the past much of the localization testing has been repetitive manual testing. Testers would manually step through every menu item and other link to instantiate every dialog and property sheet in the program and inspect it visually and test the behavior of such things as tab order, access keys, etc. for errors. This painstaking process would be repeated multiple times during the project lifecycle on every localized language version. Unfortunately, not only was this boringly repetitive, but because the manual testers were looking at so many dialogs during the workday their eyes simply tired out leading to missed bugs. So, there must be a better way.

We know that each dialog has a 2-dimensional size usually measured in pixels. Once we know the height and width of the dialog or property sheet we can measure the distance from the left most edge of the dialog to the leading edge of the first control. Using control properties such as size and location that are stored in the form’s resource file we can measure the size and position of each control on a dialog or property sheet. Once all controls are identified the distance and position of the controls can then be measured in relation to the dialog or property sheet and other controls.

Using a simple example, let’s consider 1 dimension of a dialog as 250 pixels wide. The dialog contains a label control that is 15 pixels to the right of the left most edge of the dialog, and that label is 45 pixels in length. The textbox control next to the label starts at position 70, so there are 10 pixels between the right edge of the label control and the left edge of the textbox control. Now, let’s say that textbox control is 150 pixels wide. By calculating the width of the 2 controls plus the distance between the controls we can see that truncation will occur on this dialog. Similarly, we can also evaluate the relative position of controls on a dialog and detect alignment both horizontally and vertically more accurately than the human eye.

Of course this is not a simple solution, but if you have thousands of dialogs and property sheets, and multiple language versions investing in an automated solution may be invaluable. One internal case study testing efficiency increased and significantly reduced manual testing and overall direct costs, and the effectiveness/accuracy of reported issues also increased. Perhaps not for everyone, but it is possible!

Localization Testing Part III

Originally Published Tuesday, November 03, 2009

Part 1 provided an overview of localization class issues, and Part II discussed issues with non-translated strings in a localized product and gave some helpful hints to manage that problem during the software development lifecycle. In Part III I will cover various issues with access key mnemonics. An access key mnemonic is the underlined letter on a menu or control that corresponds to a key on the native keyboard layout for a particular language.

Missing & duplicate access key mnemonics

dup keysInterestingly enough, most localization tools have built in tools to test for duplicate key mnemonics; however, missing or duplicate access key mnemonics is another significant issue in localization testing, and also affects the English language version as well. Duplicate or missing key mnemonics can adversely affect the usability of software because it impacts the ability of the user to easily access or invoke commonly repeated functions using the keyboard. Duplicate or missing duplicate key mnemonics can also negatively impact the software’s ability to meet certain accessibility requirements.

 

Missing access key mnemonicsAlthough missing or duplicate access key mnemonics are sometimes caused by poorly designed dialogs with an overabundance of controls, there are other factors that can cause duplicate key mnemonics. For example, some controls may dynamically appear in some dialogs in specific machine states. These dynamically generated controls may also come from a file that is different than the file which generated the dialog. Another reason for duplicate key mnemonics could also be dynamically generated key mnemonic assignments, which are especially problematic in situations where a dialog contains a mixture of statically assigned key mnemonics and dynamically generated key mnemonics.

Manual testing for missing or duplicate key mnemonics is especially labor intensive, and finding ways to automate this testing will save countless hours of time sitting in front of a computer checking menus and dialogs. There is also a large probability of missing duplicate key mnemonic assignment problems using manual testing methods because eyes get tired, people get bored, and some keys are grayed out (as in the illustration below) or may not be present in certain machine states. Fortunately, there are several automation tools that detect duplicate key mnemonic problems and automated detection is more effective than manual test approaches. For example, Automation.Element.AccessKeyProperty in the UIAutomation class library in C# is one approach to more efficiently test access key mnemonics.

Access key mnemonic assignments

Inappropriate key assignments

As a general rule of thumb (heuristic), key mnemonics should be assigned to characters mapped to the default state of the keyboard for each particular language. For the Latin 1 family of languages, access key mnemonics should generally not be assigned to non-ASCII characters; even if that particular character is accessible on the default state keyboard layout for a particular language. Certain, access key mnemonics should never be assigned to a character glyph that is formed through combining characters used in languages such as Thai and Arabic. Also, punctuation characters should never be assigned as access key mnemonics.Inappropriate key assignments

Of course, the default keyboard layout for many non-Latin 1 languages only contain characters in the native script for that language, and assigning non-ASCII characters as access key mnemonics may be the only choice. However,

Japanese hiragana and katakana glyphs, Korean Hangeul glyphs, and all East Asian ideographic glyphs are invalid character assignments for access key mnemonics. The default keyboard layout for most East Asian languages (Japanese, Korean, Simplified Chinese, and Traditional Chinese) is the standard keyboard layout similar to the US English keyboard. In the above example, their is no way for a Japanese user to access the ‘My Computer’ (マイ コンピユータ)menu item because it is using a katakana character as an access key mnemonic (which violates several guidelines for access key mnemonic assignment). Also, the standard key mnemonic guidelines described below should be used for all East Asian language versions for consistency and backwards compatibility.

Inappropriate key assignmentsAnother general guideline to follow for access key mnemonic assignments is to avoid the lower case Latin letters ‘g’ ‘y’ ‘p’ ‘q’ and ‘j’ because there is a high probability of confusion especially with high display resolutions especially with the letters i and l, and q and g. These letters could also be hard to discern on high resolution desktop settings as well. If the number of controls on a single dialog or in a menu list require usage of inappropriate key mnemonics, then perhaps the problem is the design of the dialog.

East Asian language versions should use the identical key mnemonics as the English language Inappropriate key assignments in East Asian languagesversion. The characters assigned as key mnemonics in the East Asian language are capitalized, enclosed within parenthesis, and positioned at the end of the translated string. Even when a key mnemonic appears within words or acronyms which are not translated or transliterated into the target the key mnemonic should be relocated to the end of the string and enclosed within parenthesis for consistency.

The character assigned as the key mnemonic should be capitalized because many East Asian computer users use an English key keyboard, and for users whose native language does not frequently employ Latin characters it is much easier for those users to visually identify key mnemonics which are capitalized with keys on the keyboard which are also capital case characters.

Accelerator Keys

Accelerator keys are commonly refered to as shortcut keys.  Accelerator keys are keys (such as F1 – F12 and Esc) or key combinations (Ctrl + Shift + B, or Ctrl C) that allow users to evoke certain functions without navigating the software menus via access keys or using the mouse to click button controls on a toolbar. Here is good source for common Windows accelerator keys, and here is one for common accelerator keys for Office products.

Shortcut key combinations are common throughout all language versions. Contrary to the Wikipedia entry on the subject some language versions localize the letter key (not a mnemonic…it is not underlined). For example,  in German the Ctrl key is localized as "Strg" and, dispite it is generally frowned upon to change the ASCII upper case letter assigned to an accelerator key combination the Spanish versions of Office use Ctrl+G (Guardar) for Save instead of Ctrl+S, and Ctrl+N (Negrita) for Bold instead of Ctrl+B. Also, letter keys used as part of an accelerator key combination are capitalized. East Asian language versions use Ctrl to designate the Control key. Also, accelerator key combinations do include an elipsis after the letter.

In Part IV I will discuss common layout issues such as clipping and truncation.

Localization Testing – Part II

Originally Published Friday, October 30, 2009

I should be of no surprise to anyone that localization testing generally focuses on changes in the user interface, although as mentioned in the previous post these are not the only changes necessary to adapt a product to a specific target market. But, the most common category of localization class bug are usability or behavioral type issues that do involve the user interface. Bugs in this category generally include un-localized or un-translated text, key mnemonics, clipped or truncated text labels and other user interface controls, incorrect tab order, and layout issues. Fortunately, the majority of problems in this category do not require a fix in the software’s functional or business layer. Also, the majority of problems in this category do not require any special linguistic skills in order to identify, and in some cases, an automated approach can be even more effective than the human eye (more on that later).

Perhaps the most commonly reported issue in this category is “un-localized” or un-translated textual string. Unfortunately, in many cases un-translated strings is also an over-reported problem that only serves to flood the the defect tracking database with unnecessary bugs. Translating textual strings is a demanding task, and made even more difficult when there are constant changes in the user interface or contextual meaning of messages early in the product life cycle. Over-reporting of un-translated text too early in the product cycle only serves to artificially inflate the bug count, and causes undue pressure and creates extra work for the localization team.

Identifying this type of bugs is actually pretty easy. Here’s a simple heuristic; if you are testing a non-English language version in a language you are not familiar with and you can clearly read the textual string in English it is probably not localized or translated into the target language. The illustration below provides a pretty good example of this general rule of thumb. A tester doesn’t have to read German to realize that the text in the label control under the first radio button is not German.

1

There are several causes of un-localized text strings to appear in dialogs and other areas of the user interface. For example:

  • Worse case scenario is that the string is hard-coded into the source files
  • Perhaps localizers did not have enough time to completely process all strings in a particular file
  • Perhaps this is a new string in a file localizers thought was 100% localized
  • Strings displayed in some dialogs come from files other than the file that generates the dialog, and the localization team has not process that file
  • And, sometimes (usually not often), a string may simply be overlooked during the localization process

Testing for un-localized text is often a manually intensive process of enumerating menus, dialogs, and other user interface dialogs, message boxes and form, and form elements.  But, if the textual strings are located in a separate resource file (as they should be), a quick scan of resource files might more quickly reveal un-translated textual strings. Of course, there is little context in the resource file, and I also hope the localization team is reviewing their own work as well prior to handing it over to test.

Also, here are a few suggestions that might help focus localization testing efforts early in the project milestone and reduce the number of ‘known’ or false-positive un-translated text bugs being reported:

  • Ask the localization team to report the percentage of translation completion by file or module for each test build. Early in the development lifecycle only modules that are reported to be 100% complete which appear to have un-translated text should be reported as valid bugs. Of course, sometimes some strings are used in multiple modules, or may be coming from external resources. But, especially early in the development lifecycle reporting a gaggle of un-translated text bugs is simply “make work.” As the life cycle starts winding down…all strings are fair game for bug hunters!
  • Testers should use tools such a Spy++ or Reflector to help identify the module or other resources, and the unique resource ID for the problematic string or resource. This is much better then than simply attaching an image of the offending dialog to a defect report. Identifying the module and the specific resource ID number allows the localization team to affect a quick fix instead of having to search for the dialog through repro steps and track down the problem.
  • Also remember that not all textual strings are translated into a specific target language. Registered or trademarked product names are often not translated into different languages. In case of doubt, ask the localization team if a string that appears un-localized is a ‘true’ problem or not.

Unlocalized strings usually due to hard coded strings also tend to occur in menu items. This is especially true in the Windows Start menu or sub-menu items hard-coded in the INF or other installation/setup files. For example, the image on the right shows a common problem on European versions of Windows. Many European language versions  localize the name of the Program Files folder, and the menu item in the start menu. But, often times when we install an English language version of software to Windows it creates a new "Programs" menu item (and even a new Program Files directory, rather than detecting the default folder to install to. In the example on the left, the string Accessories is a hard-coded folder name. But, there is another issue as well. This illustrates not only a problem with the non-translated string "Accessories," but also shows one full-width Katakana string for ‘Accessories’ and another half-width string.

In part 3 I will discuss another often problematic area in localization….key mnemonics.

Localization Testing: Part 1

Originally Published Tuesday, October 27, 2009

When I first joined Microsoft 15 years ago I was on the Windows 95 International team. Our team was responsible for reducing the delta between the release of the English version and the Japanese version to 90 days, and I am very proud to say that we achieved that goal and Windows 95 took Japan by storm. It was so amazing that even people without computers were lined up outside of sales outlets waiting to purchase a copy of Windows 95. The growth of personal computers in Japan shot through the roof over the next few years. Today the Chinese market is exploding, and eastern European nations are experiencing unprecedented growth as well.  While the demand for the English language versions of our software still remains high, many of our customers are demanding software that is ‘localized’ to accommodate the customers national conventions, language, and even locally available hardware. Although much of the Internets content is in English, non-English sites on the web are growing, and even ICANN is considering allowing international domain names that contain non-ASCII characters this week in Seoul, Korea.

But, a lot has changed in how we develop software to support international markets. International versions of Windows 95 were developed on a forked code base. Basically, this means the source code contained #ifdefs to instruct the compiler to compile different parts of the source code depending on the language family. From a testing perspective this is a nightmare, because if the underlying code base of a localized version is fundamentally different than the base (US English) version then the testing problem is magnified because there is a lot of functionality that must be retested. Fortunately today, much software being produced is based on a single-worldwide binary model. (I briefly explained the single world wide binary concept at a talk in 1991, and Michael Kaplan talks about the advantages here.) In a nutshell, a single worldwide binary model is a development approach in which any functionality any user anywhere in the world might need is codified in the core source code so we don’t need to modify the core code once it is compiled to include some language/locale specific functionality.  For example, it was impossible to input Japanese text into Notepad on an English version of Windows 95 using an Input Method Editor (IME); I needed the localized Japanese version. But, on the English version of Windows Xp, Vista, or Windows 7 all I have to do is install the appropriate keyboard drivers and font files and expose the IME functionality. In fact, these days I can map my keyboard to over 150 different layouts and install fonts for all defined Unicode characters on any language version of the Windows operating system.

The big advantage of the single worldwide binary development model is that it allows us to differentiate between globalization testing and localization testing.  At Microsoft we define globalization as “the process of designing and implementing a product and/or content (including text and non-text elements) so that it can accommodate any locale market (locale).” And, we define localization as “the process of adapting a product and/or content to meet the language, cultural and political expectations and/or requirements of a specific target market.” This means we can better focus on the specific types of issues that each testing approach is most effective at identifying. For localization testing, this means we can focus on the specific things that change in the software during the “adaptation processes” to localize a product for each specific target market.

The most obvious adaptation process is the ‘localization’ or actually the translation of the user interface textual elements such as menu labels, text in label controls, and other string resources that are commonly exposed to the end user. However, the translation of string resources is not the only thing that occurs during the localization process. The localization processes that are required to adapt a software product to a specific local may also include other changes such as font files and drivers installed by default, registry keys set differently, drivers to support locale specific hardware devices, etc.

3 Categories of Localization Class Bugs

I am a big fan of developing a bug taxonomic hierarchy as part of my defect tracking database as a best practice because it better enables me to analyze bug data more efficiently. If I see a particular category of bug or a type of bug in a category that is being reported a lot, then perhaps we should find ways to prevent or at least minimize the problem from occurring later in the development lifecycle. After years of analyzing different bugs, I classified all localization class bugs into 3 categories; functionality, behavioral/usability, and linguistic quality.

Functionality type bugs exposed in localized software affect the functionality of the software and require a fix in the core source code. Fortunately, with a single worldwide binary development model where the core functionality is separated from the user interface the number of bugs in this category of localization class bugs is relatively small.  Checking the appropriate registry keys are set and files are installed in a new build is reasonably straight forward and should be built into the build verification test (BVT) suite. Other types of things that should be checked include application and hardware compatibility. It is important to identify these types of problems early because they are quite costly to correct, and can have a pretty large ripple effect on the testing effort.

Behavioral and usability issues primarily impact the usefulness or aesthetic quality of the user interface elements. Many of the problems in this category do not require a fix in the core functional layer of the source code. The types of bugs in this category include layout issues, un-translated text,  key mnemonic issues, and other problems that are often fixed in the user interface form design, form class, or form element properties. This category of problems often accounts for more than 90% of all localization class bugs. Fortunately, the majority of problems in this category do not require any special linguistic skills; a careful eye for detail is all that is required to expose even the most discrete bugs in this category.

The final category of localization class bug is linguistic quality. This category of bugs are primarily mis-translations. Obviously, the ability to read the language being tested is required to identify most problems in this category of errors. We found testers spent a lot of time looking for this type of bug, but later found the majority of linguistic quality type issues reported were resolved as won’t fix. There are many reasons for this, but here is my position on this type of testing….Stop wasting the time of your test team to validate the linguistic quality of the product. If this is a problem then hire a new localizer, hire a team of ‘editors’ to review the work of the localizer, or hire a professional linguistic specialist from the target locale as an auditor. Certainly, if testers see an obvious gaff in a translation then we must report it; however, testers are generally not linguistic experts (even in their native tongue), and I would never advocate hiring testers simply based on linguistic skills nor as a manager would I want to dedicate my valuable testing resources on validating linguistic quality…that’s usually not where their expertise lies, and it probably shouldn’t be.

What’s Next

Since behavioral /usability category issues are the most prevalent type of localization class bug this series of posts will focus on localization testing designed to evaluate user interface elements and resources. The next post will expose the often single most reported bug in this category.

Adding Variability in Test Case Design

Published Tuesday, October 20, 2009 IMG_5549

I love autumn! Yes, I am definitely a boy of summer and very much prefer warmer weather; however, there is something special about autumn. This past weekend my daughter, and my 2 friends Dongyi and her husband Yuning and I participated in the Rum Run sailboat fun race with an overnight raft up at Bainbridge Island’s Port Madison. Saturday morning was quite rainy, but the wind was blowing 15 knots with gusts to 25 knots and NOAA weather radio announcing gale force warnings in Puget Sound. Wow…what a ride! But, it was actually the rather relaxing sail back to my marina on Sunday morning that rekindled the beauty of autumn in my mind. The bright reds, golden yellows, and pastel browns of the foliage seemed to blend into a collage framed by the darkness of the waters of Puget Sound and the snow covered peaks of the Olympic mountains. The beauty of autumn reminds me about change. A sloughing of the old, the cleansing brought about by the pure white snows, eventually followed by the new and fresh growth that blossoms in spring.

Just as the earth goes through variable cycles of rejuvenation, we must also continually update our tests, and (more importantly) the test data we use in our test cases to prevent them from becoming stale. Trees shed their leaves in the autumn and new leaves emerge in the spring, but the tree is fundamentally still the same tree. Similarly, a well-designed test case has a unique fundamental purpose and by changing the variables we can grow the value of that test case. Of course, the cycle of change in test data should be dramatically shorter in duration as compared to the seasonal changes of mother earth.

Here is a simple example of how a well-designed test case using variable test data can increase the value of the information each  test iteration provides through increased confidence and also potentially reduce overall risk. In my role at Microsoft I am in a unique position to not only conduct controlled studies, but I can also implement ideas into practice on enterprise level software projects. One experiment I started about 2 years ago involved multiple groups of testers (sessions) located around the world divided into 3 separate control groups. Each control group tested the identical web page that would display the stock price if the user input a valid stock ticker symbol into a single textbox on the page and pressed the OK button. The only difference in the control groups was the instructions to perform single positive test case with the specific purpose of “ensure any valid stock ticker symbol displays the current stock price for the publicly traded stock specified by its symbol.” The purpose of the study was to determine if different cultural and experiential backgrounds impacted the test data used in a test based on the instructions for a test case. The study collected demographic information on the participants as well as specific inputs applied to the web page. Information on the oracle used by the students was collected anecdotally. Step one in each test was identical because we were not interested in how the tester launched the browser. (Of course this assumes there are other tests that test the multitude of ways to launch a browser and navigate to a URL. Also, if the browser failed to launch the test case is blocked.)

Group 1 was given the most vague instructions for the test case. The instruction was simply:

  1. Launch browser and navigate to [url address]
  2. Enter a valid stock ticker symbol and press the OK button and verify the accuracy of the returned stock price.

The instructions in the test case given to Group 2 were also somewhat vague, but provided a little guidance both on input and oracle.

  1. Launch browser and navigate to [url address]
  2. Enter a valid stock ticker symbol (e.g. “MSFT”)
  3. Press the OK button
  4. Verify the returned stock price is identical to the current stock price listed on the appropriate exchange

Group 3 had similar instructions to Group 2, but the group was given additional guidance as indicated below.

  1. Launch browser and navigate to [url address]
  2. Enter a valid stock ticker symbol from a publicly traded stock listed on any public stock exchange
    Listings of valid stock ticker symbols are on stock exchange web sites such as:
    http://www.nyse.com
    http://www.eoddata.com/Symbols.aspx
    http://www.nasdaq.com
    http://www.londonstockexchange.com
  3. Press the OK Button
  4. Verify the returned stock price is identical to the current stock price listed on the appropriate exchange

Results

The results were mostly not surprising, but rather reinforcing. For example, we expected Group 1 to be rather random, but mostly aligned with ticker symbols they were familiar with. Of course, the majority (90%) of stock ticker symbols entered was MSFT and there was no significant difference in cultural background, locale, experience or educational background. (As this study was conducted at Microsoft I am sure there was some bias as to the symbol entered.) What was most interesting was that testers with no formal training (no previous courses in testing, no CS degree, and read less than one discipline specific book) and with more than 2 years of test experience were approximately more likely (25%) to violate the purpose of the test and enter random or completely invalid data as their first action. In other words, instead of executing the required test their initial reaction was to immediately go on a bug hunt.

In group 2 99% of the participants simply entered the stock ticker symbol “MSFT.” But, what was even more surprising was the fact that one the next day, the same people in that group were given the same exact test, and 95% of them simply reentered MSFT. Perhaps this is laziness, perhaps this is related to the superficial nature of the study, or perhaps this is due to individuals taking the path of least resistance. The percentage of people who entered identical stock ticker symbols on consecutive days was not significantly different between group 1 and group 2.

It should be no surprise that group 3 had the greatest distribution of variable test data applied to the web page. Demographics had no impact on any of the people who were in group 3. The majority of people in group 3 (78%) would select the first stock exchange listed (regardless of what link it was) but there was no significant overlap in the selected stock ticker symbols. When asked to repeat the test on the next day 83% of the participants selected a different link and and a different symbol. Of those who selected the same link 97% selected a different stock ticker symbol. On the down side, approximately 4% of the people simply took the path of least resistance and input MSFT as the test data on both days of the experiment.

Conclusion

One of the most common problems I hear about ‘scripted,’ or pre-defined test cases is that they are too prescriptive and not flexible enough to allow the tester to try things. Of course, a well-designed test case is not simply a prescriptive set of steps inputting the same hard coded test data they run over and over. So, in this study we made the assumption that a scripted test case that specified “Enter MSFT in the textbox” would simply result in the tester entering “MSFT” without any thinking on the part of the tester. Hard-coding variable test data is often times the worse possible way to design a test case.

Vaguely written test cases added some level of variability, but also seemed to increase the probability of the tester executing context free tests outside the scope of the purpose of the test. In fact, what we found was some testers (approx 2%) simply went on a bug hunt and never actually input a valid stock ticker symbol at all during the session.

A test case that provided only one example that is representative of the type of test data required for the test case produced the least desirable results in this study. I am not sure this would be the case in practice. However, based on this study if I were to outsource execution of a test case similar to that used by group 2 the only thing I could guarantee is that MSFT would definitely be tested numerous times, and the variability of other test data would be extremely limited regardless of the number of testers executing that test or the number of iterations.

When faced with a virtually infinite number of possibilities for input variables as test data used in either positive or negative tests we need to test as many possibilities as possible given the available resources in order to increase test coverage and reduce overall risk. So, one way increase the coverage of test data while still achieving the specific purpose of the test case is to provide useful resources that help guide the tester while relying on the tester’s creative thinking skills and curiosity to expand the test coverage.

Of course, we can also increase variability of test data and capture the essence of the tester’s creativity using a similar approach in a well-designed automated test case as well. In fact, a similarly designed automated test case enables us to significantly increase the amount of variable test data that is exercised in order to expand test coverage and increase overall confidence.

Randomizing Static Test Data in Automated Tests

Originally Published Sunday, October 11, 2009

A significant percentage of static test data is stored in tabular comma delimited or tab-delimited formats and saved in Excel spreadsheets. Reading in comma or tab-delimited static test data into an automated test is pretty straight forward and there are numerous examples in many programming languages illustrating how to read in these types of test data repositories. Reading in rows of data is the foundation of data-driven automation and definitely has its place in any automation project.

I am a big proponent of stochastic (random) test data generation that is customized to the context, but I also know that sometimes static test data is useful for establishing baselines and more exact emulation of ‘real-world’ customer-like inputs. But, if the automated test is simply passing the same variable arguments to the same input parameters in the same order over and over again the value of subsequent iterations of that automated test using that static data set diminishes rather quickly. So how can we more effectively utilize static test data in our automated tests?

One possible solution is to randomly select an argument from a collection of static variables that is passed to the specific input parameter. The advantage of this approach is that it effectively increases the test data permutations in each iteration of the test case. For example, let’s consider 2 input parameters; one for a given name and one for a surname. In a traditional data-driven approach in which the static test data is read in by rows our test data file might be similar to:

Bob,Smith
John,Johnson
Roger,Williams
Steve,Abbot

This static data file would give us 4 sets of test data, but each time the test data is read into the test case the given and surnames are always the same.

However, if we read in the given names and surnames into 2 collections, and then randomly select a given name and surname from the appropriate collection to pass to the respective parameter we effectively have 16 possible combinations of static test data to work with. An advantage of this approach is that our ‘collections’ of given names and surnames can contain differing numbers of elements (in which case the number of possible combinations of test data is the Cartesian product of the number of elements in each collection).

Of course there are many ways to accomplish this. For example, one approach is to continue to use a comma or tab-delimited file format and list given names in one row and surnames in a second row. Another approach is to list the given names and surnames in columns in a spreadsheet and read in each column into a collection of some sort. The latter is the approach I used in developing my PseudoName test data generator tool. I chose this approach for 2 reasons; first an Excel spreadsheet is a simple yet powerful file format for storing static test data, and secondly because lists of test data are sometimes better represented in columns rather than rows.

The following code shows one way to read in test data by columns from an Excel spreadsheet.

   1: // <copyright file="datareader.cs" company="TestingMentor"> 

   2: // Copyright © 2009 by Bj Rollison. All rights reserved. 

   3: // </copyright> 

   4:   

   5: namespace TestingMentor.TestTool 

   6: { 

   7:   using System; 

   8:   using System.Collections; 

   9:   using System.Globalization; 

  10:   using System.Runtime.InteropServices; 

  11:   using System.Threading; 

  12:   using Excel = Microsoft.Office.Interop.Excel; 

  13:   

  14:   /// <summary> 

  15:   /// This class contains methods for reading test data from Excel spreadsheets 

  16:   /// </summary> 

  17:   public class TestDataReader 

  18:   { 

  19:     /// <summary> 

  20:     /// This method reads all the data elements in the specified number of 

  21:     /// columns in the specified Excel spreadsheet containing the test data 

  22:     /// and copies the data into a multi-dimensional array 

  23:     /// </summary> 

  24:     /// <param name="dataFileName">The filename containing the test data</param> 

  25:     /// <param name="columnCount">The number of columns in the Excel 

  26:     /// spreadsheet to read</param> 

  27:     /// <returns>A multi-dimensional array containing the data eleements for 

  28:     /// each column </returns> 

  29:     public static string[][] ExcelColumnReader(string dataFileName, uint columnCount) 

  30:     { 

  31:       CultureInfo originalCulture = null; 

  32:       Excel.Application excelApp = null; 

  33:       Excel.Workbook excelWorkbook = null; 

  34:       Excel.Worksheet excelActiveWorksheet = null; 

  35:       string[][] testData = new string[columnCount][]; 

  36:   

  37:       originalCulture = Thread.CurrentThread.CurrentCulture; 

  38:       Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US"); 

  39:   

  40:       excelApp = new Excel.Application(); 

  41:       excelWorkbook = excelApp.Workbooks.Open( 

  42:         dataFileName, 

  43:         0, 

  44:         false, 

  45:         5, 

  46:         String.Empty, 

  47:         String.Empty, 

  48:         false, 

  49:         Type.Missing, 

  50:         String.Empty, 

  51:         true, 

  52:         false, 

  53:         0, 

  54:         true, 

  55:         false, 

  56:         false); 

  57:       excelActiveWorksheet = (Excel.Worksheet)excelWorkbook.ActiveSheet; 

  58:   

  59:       for (int i = 0; i < columnCount; i++) 

  60:       { 

  61:         // Start at column 1 

  62:         object columnIndex = i + 1; 

  63:   

  64:         // Row 1 is the column title; test data starts on Row 2 

  65:         object rowIndex = 2; 

  66:         ArrayList tempCollection = new ArrayList(); 

  67:         while ( 

  68:           ((Excel.Range) 

  69:           excelActiveWorksheet.Cells[rowIndex, columnIndex]).Value2 != null) 

  70:         { 

  71:           tempCollection.Add( 

  72:             ((Excel.Range) 

  73:             excelActiveWorksheet.Cells[rowIndex, columnIndex]).Value2); 

  74:           rowIndex = (int)rowIndex + 1; 

  75:         } 

  76:   

  77:         testData[i] = new string[tempCollection.Count]; 

  78:         testData[i] = (string[])tempCollection.ToArray(typeof(string)); 

  79:       } 

  80:   

  81:       // Clean up 

  82:       excelWorkbook.Close(false, Type.Missing, Type.Missing); 

  83:       excelWorkbook = null; 

  84:       excelApp.Quit(); 

  85:       excelApp = null; 

  86:   

  87:       // Garbage collection is not pretty, but necessary to release Excel proc 

  88:       System.GC.Collect(); 

  89:       System.GC.WaitForPendingFinalizers(); 

  90:   

  91:       if (originalCulture != null) 

  92:       { 

  93:         Thread.CurrentThread.CurrentCulture = originalCulture; 

  94:       } 

  95:   

  96:       return testData; 

  97:     } 

  98:   } 

  99: } 

I must tell you that performance can be an issue especially if the columns contain a lot of data. For example, to read in approximately 700 elements of test data in 3 separate columns took slightly less than 1 second, and reading in 1800 elements in 3 columns required just over 4 seconds. Unfortunately, I didn’t compare total byte counts, but it is pretty obvious the greater the number of test data elements being read the longer the read operation will take and you certainly will have to take the read time into consideration in your automated test case.

Reading static test data line by line from a data file while looping through a data-driven automated test case is a useful test design approach in some situations, this is another useful approach that will allow the test designer to randomize the combinations of static test data values applied to multiple input parameters in multiple iterations of an automated test case.