Archive for November, 2009
Tonight on my way to teach a test automation course at the University of Washington I had some free time to catch up on my reading. My manager asked me if I had read this month’s copy of one of the several testing magazines we get and I replied that I had downloaded it but hadn’t had a chance to read it yet. So, he tossed me the hardcopy of the magazine and said, “Enjoy.” Now this should have been a clue because although Alan is a great manager and mentor, I think he secretly likes to see the veins in my neck swell and blood shoot out of my eyes from time to time.
I read a lot of articles, white papers, and books. I like most of what I read, even if I disagree with some of the points being made. I can’t remember ever reading an article on software testing that ever made me angry. I was not angry because of the message of the article. In fact, I think the point the authors are trying to make is valid and I agree with them on their fundamental point. Unfortunately, the article is filled with technical inaccuracies the end message was almost lost.
I spent the last 10 years studying various techniques, methods, and approaches in software testing. I teach more than 500 testers a year on structural testing techniques, and am now working with a team in the Windows division to implement a new tool for just in time code coverage analysis at the component level that allows us to see how our tests exercise code paths in changed code and the dependent modules. I also discuss structural testing in chapter 5 of our book How We Test Software At Microsoft. I don’t really consider myself to be an expert in the subject, but I might know a thing or two about it. So, let’s Reconsider Code Coverage!
In August 2007 I wrote an informative blog post on the potential misuse of the code coverage measure. But code coverage measures are used by some companies as one of many ways to help them reduce risk. And, let me be very clear here, there is no correlation between code coverage and quality, and code coverage measures don’t tell us “how well” the code was tested. The code coverage measure simply measures what code has been executed, and more importantly what code has not been executed. The value of measuring code coverage is not in producing some “magic number,” but that it helps testers investigate untested or under-tested areas of the product and design additional tests (generally using structural testing techniques) to improve coverage and reduce overall risk.
Just because you execute a line of code doesn’t mean a bug doesn’t still exist, but if you don’t execute a line of code you have 0 probability of finding a bug if one exists!
Also it is important to note there are several ways to measure code coverage. Different tools employ different measures and sometimes different tools measure the same type of coverage differently. Also, I discovered that even the same tool can measure the same code differently depending on how it is compiled (debug, retail, etc.) and previously wrote about my study. Some of the basic ways to measure code coverage (not test coverage) include:
- Function coverage measures the percentage of functions or methods in a class or application that are called at runtime.
- Statement coverage measures the percentage of executable statements exercised at runtime.
- Block coverage measures the percentage of each sequence of non-branching statements that are executed at runtime. Block coverage subsumes statement coverage.
- Decision or branch coverage measures the percentage of both Boolean (not binary) outcomes (true and false) of simple conditional expressions at runtime. If a predicate statement has more than one conditional sub-expression decision (or branch) coverage treats that predicate statement as one conditional clause. Decision coverage subsumes block coverage.
- Condition coverage measures the percentage of both Boolean outcomes of each conditional sub-expressions that are separated by logical and or logical or in compound predicate statements. Condition coverage subsumes decision coverage.
- Basis path coverage measures the number of linearly independent paths through a program. Basis path coverage is based on McCabe’s cyclomatic complexity research.
- Path coverage measures every possible path from the entry to the return statement (or exception) or exit of every method. Unfortunately path testing is usually impossible due to the sheer number of path combinations, and the inability to execute constrained path combinations.
Clearly there are different measures of code coverage, and certain types of measures subsume other measures. So, now that we have a handle on the different types of code coverage measures, let’s look at testing some code. We will use the same pseudo code used in the aforementioned article which is based upon the following requirement.
“Student ID’ are seven digit numbers between one million and 6 million inclusive.”
The authors provided the following pseudo code example for a function to meet this requirement.
function validate_studentid(string sid) return
STATIC TRUEFALSE isOk;
isOk = true;
if ((length(sid) is not 7) then
isOk = False;
if (number(sid) <= 1000000 or number(sid) > 6000000 then
isOk = False;
So, other than the fact that there is no reason to ‘test’ the length of the sid variable before evaluating it to see if it is within the allowable range (removing this first conditional improves performance and also improves testability of the code), and that if the call to the number() function fails to convert the string to a number for a valid Boolean comparison it will throw an unhandled exception, let’s look at path testing of this simple example by starting with control flow diagrams of each possible path (assuming the call to the number() function does not throw an unhandled exception by passing this message a string of characters such as “foo” rather than a string of digits).
(Edited 11/25: After thinking about this a bit more, if the number() function returned a 0 (zero) if the input was incorrectly formatted, then the number() function would not throw an exception, and the control flow path would be identical to the first test in the table below).
Because we are doing path coverage testing and not decision testing, we actually have to separate each Boolean conditional sub-expression in the second compound predicate statement if (number(sid) <= 1000000 or number(sid) > 600000. The example in the article treated both sub-expressions in the compound predicate statement as a single Boolean expression which would be synonymous with decision coverage. Path coverage actually treats each sub-expression as if there were 2 single Boolean conditions such as
if (number(sid) <= 1000000
isOk = False;
if number(sid) > 600000
isOk = False;
The table below illustrates the tests required for testing control flow through this function for path coverage (again assuming we are going to ignore the unhandled exception in the code that would occur by passing in a string such as “foo.”)
number <= 1mill
number > 6mil
The first test would be a value less than 7 digits, and would cause all Boolean conditional expressions to evaluate as true which will set the isOk variable to false (3 times), and we correctly return the expected result of false (or invalid ID). The second test is a number greater than 6,000,000 (but less than the maximum value that would result in an unhandled overflow exception hopefully being thrown by the number() function). In this case the 3rd conditional expression (if (number(sid) > 6000000) would evaluate as true and the function would return false. The 3rd path is buggy. In this pseudo code example, the only possible way to exercise the true outcome of the Boolean condition if (number(sid) <= 1000000 is to use the value of 1,000,000; any other value larger or smaller will cause this Boolean condition to evaluate as false. In this case we expect the function to return true, but it in fact will return false. Finally, any number greater than 1000001 and less than or equal to 6000000 will return a true result indicating a valid student ID.
The article also suggest that structural testing misses other problems. But, when we look at these issues, they actually have nothing to do with structural testing of the function; in other words they are completely out of context of the problem being discussed.
For example, the assert is the requirement is incorrect and should have read 6,999,999 (which I believe is a typo and should be 5,999,999) because of confusion over the word “inclusive.” Inclusive means “including the stated limit or extremes in consideration or account,” but in computing inclusive means “the predicate holds for all elements of an increasing sequence then it holds for their least upper bound.” I disagree with this assumption because I suspect the analyst writing the spec is basing the inclusive range on the common definition, and not a definition based on domain theory.
The article questions what would occur with incorrectly formatted numbers such as 123 456 789 or 123,456,789. So, beside the point that these values are not within the valid range of student id numbers, the answer to the question would actually lie in how the number() function being called handles improperly formatted numbers (e.g throwing a format exception, which again is unhandled in our validate_studentid() function), or how an event handler that sits between the UI and the function might deal with invalid or incorrectly formatted inputs.
The next question concerned resizing of the input window or the screen (assuming desktop resolution) and repainting the window or form and its affect on code coverage of the validate_studentid() function. Well, I am going out on a limb here and I am going to say…”what are you talking about?” I am not quite sure how to phrase this, but let me try…resizing or repainting a window has 0 effect on the structural control flow of the validate_studentid() function. (Of course, I could be wrong, and the length() function number() function might have some code that mysteriously interacts with the repainting libraries and how it determines the length of a string or whether a string is a valid number.)
Bugs in external libraries are part of the business. Hopefully those external libraries are well tested or at least documented especially if our development team wrote them. Personally, I have not encountered any public functions or APIs which use wild ass random numbers such as 5.8 million as boundary values, but that’s not to say it couldn’t happen. And of course, if these external functions throw exceptions (as they should based on what they are probably doing), we should have exception handler code in our function to deal with any exceptions thrown from external libraries or function calls.
Based on incorrect path analysis, and out-of-context questions that have nothing to do with control flow through the validate_studentid() function the article suggests that path testing is not a magic potion, but I am not too sure that anyone actually believes it is. And so, the article suggests that “input combinatorics coverage” might work better. Hmm…now I have been teaching combinatorial testing for over 10 years and have read some interesting papers on the effectiveness of combinatorics on statistical testing and code coverage, and I must say I pretty sure you need more than one input parameter in combinatorial testing!
Finally, I don’t agree that code coverage measures tell us “how well the developers have tested their code.” The code coverage measure only tells us what percentage of the code has been executed in a particular way, and more importantly it tells us how what percentage of code has been untested. We must determine whether we need to investigate that area to reduce risk. Of course, many code coverage tools provide a “heat map” that helps us and developers identify untested code, and that is where we shift from the simple act of measuring coverage to the testing method of code coverage analysis in order to design new tests that effectively exercise previously untested code if that level of coverage is important to reduce overall risk.
My intent here is not to ridicule the authors of the article. In fact, I agree with their summation that testers should not believe high code coverage numbers mean “well tested.” (Again see my blog post from Aug 2007.) Unfortunately, the path to the point was fraught with inaccuracies and tangents that I almost never made it to the end.
There are many books and white papers on this subject in the ACM and IEEE libraries. Books by Boris Beizer, Robert Binder, and others go into great detail on structural testing. McCabe’s papers linked to in this post are an excellent resources.
OK…I feel better now. I need to clean up the blood, take a sedative, and go to sleep.
Some of you know that I am an organic gardener. I do this not only because I like the fresh taste of vegetables and fruits that are not tainted with chemicals, but my daughter loves to eat things right off the vine or stem as we are harvesting our bounty and I am perhaps overly protective of my daughter. Here in the Pacific Northwest slugs are a particular problem, and I must use several different techniques and approaches to try to ward off or destroy these nasty, slimy creatures throughout the year because no single approach is 100% effective. Yes it is unfortunately true, even with my diligent efforts and myriad of tactics a few slugs still get by into the raised gardens. This sure seems a lot like software testing and how some tests find some bugs and miss others, and how iterative builds seem to allow new bugs to continually creep in. But, this really isn’t a revelation.
In 1983, in his seminal book Software Testing Techniques, Boris Beizer compared the diminishing effects of insecticides on boll weevils destroying cotton fields to the decreasing effectiveness of testing methods in exposing defects in software. This became well known as the Pesticide Paradox. The Pesticide Paradox states “Every method you use to prevent or find bugs leaves a residue of subtler bugs against which those methods are ineffectual.”
Now, unlike insects (or pests), software doesn’t magically build up an immunity to bugs. What happens is that we design and execute tests using one approach that exposes some set of issues. But, then the number of issues being exposed by that approach starts to diminish, yet there are still some ‘residue of subtler bugs’ that haven’t been detected by that set of tests or testing approach. Also similar to how the ladybugs in my garden that eat aphids and mites have no effect on slugs, not all approaches or techniques we might use in our testing is effective in detecting or preventing all types of bugs.
For the past 10 years, I have been teaching various software testing practices and approaches for solving complex problems at Microsoft. One cool aspect of my job is that I get to experiment a lot in a reasonably controlled environment with diverse groups of people. We often design these studies to better understand the benefits and limitations of various testing approaches and methods in exposing different categories of defects to better understand how each approach can be used more effectively within the appropriate context. Of course, it should be of no surprise to anyone that the pesticide paradox holds true not just in the classroom, but also in practice.
I often explain Beizer’s paradox by stating, “there is no single approach or method used in software testing that is completely effective in exposing all bugs, and some approaches or methods are more effective in exposing different types or categories of bugs.”
At Microsoft there is no “one size fits all” solution, and the Engineering Excellence group doesn’t dictate how to test or what testing methods can or can’t be used. But, through a series of problem solving exercises in our new SDET training program each tester experiences the benefits and limitations of various approaches and techniques. Based on their experiences in the classroom and in ‘real life’ they also learn the most effective strategy for testing is not to rely too heavily on a single approach, and to use a variety of test design principles and patterns throughout the product lifecycle. But, that’s just how we roll.
|Software Testing Techniques 2nd Eidition|
Originally Published Thursday, November 12, 2009
The past series of posts have focused on one of localization testing which describes the largest category of localization class issues reported by testers performing localization testing, and what we categorize as usability/behavioral type issues because they adversely impact the usability of the software or how end users interact with the product. This is the last post in this series, but I do intend to publish a more complete paper covering localization testing in the near future….stay tuned. This final post in this series will discuss issues that affect the layout of controls on a dialog or window and are generally referred to as clipping or truncation.
Clipping occurs when the top or bottom portion of a control (including label controls that contain static text) is cut off and does not display the control or the control’s contents completely as illustrated below. Clipping and truncation is quite common on East Asian language versions because the default font size used in Japanese, Korean, and Chinese language versions is a 9 point font instead of the 8 point font used in English and other language versions. Clipping often occurs because developers fail to size controls adequately for larger fonts (especially common in East Asian language versions), or for display resolutions set to custom font sizes. Clipping also occurs because many localization tools are incapable of displaying a true WYSIWYG or runtime view of dialogs, requiring localizers to ‘guess’ when resizing control on dialog layouts.
It is possible to test for potential clipping and truncation problem areas without a localized application. English language version should function and display properly on all localized language versions of the Windows operating system. So, one way to check for potential clipping or truncation issues is to install the English language version of the application under test on an East Asian language version of the Windows operating system. Another testing method to test for potential clipping and truncation issues is to change the Windows display appearance or the custom font size via the Display Properties control panel applet.
However, due to the limitations of most current localization tools inability to dynamically resized controls and dialogs, and inability to display dialogs at runtime or present a true WYSIWYG view during the localization process, the localized language versions must also be tested for clipping and truncations caused by improper sizing and layout of controls.
Truncation is similar to clipping, but typically occurs when the right side of controls are cut off (or the left side of the controls in bi-directional displays used in Hebrew and Arabic languages) and do not completely display the entire control or the control’s contents.
Other Layout Issues
Because some localization tools may not provide a true ‘WYSIWYG’ display of what a dialog or property sheet will look like at runtime, occasionally resizing may cause several controls to overlap. This is especially true when dialogs contain dynamic controls that are dependent on certain configurations or machine states.
In East Asian cultures it is common for an individual’s surname to precede the given (first) name. (It is also uncommon to have a middle name, so this field should never be required.) Therefore, the controls for name type data may need to be repositioned on dialogs in East Asian language versions. The localization team will reposition the last name label and textbox controls and the given name controls. This means that the logical tab order be reset. Also, the surname textbox control should have focus when the dialog is displayed instead of the first given name field.
The tab order of controls should allow for easy, intuitive navigation of a dialog. Design guidelines suggest a tab order that changes the focus of controls from left to right and top to bottom. Focus should change between each control in a logical order, and dialogs should never have loss of tab focus’ where no control on the dialog appears to have focus.
Tab order is typically problematic even in English language versions in the early lifecycle of many projects when the user interface is in flux. There is also a high probability of introducing tab order problems any time the controls on a dialog change.
All localization testing doesn’t have to be manual
In the past much of the localization testing has been repetitive manual testing. Testers would manually step through every menu item and other link to instantiate every dialog and property sheet in the program and inspect it visually and test the behavior of such things as tab order, access keys, etc. for errors. This painstaking process would be repeated multiple times during the project lifecycle on every localized language version. Unfortunately, not only was this boringly repetitive, but because the manual testers were looking at so many dialogs during the workday their eyes simply tired out leading to missed bugs. So, there must be a better way.
We know that each dialog has a 2-dimensional size usually measured in pixels. Once we know the height and width of the dialog or property sheet we can measure the distance from the left most edge of the dialog to the leading edge of the first control. Using control properties such as size and location that are stored in the form’s resource file we can measure the size and position of each control on a dialog or property sheet. Once all controls are identified the distance and position of the controls can then be measured in relation to the dialog or property sheet and other controls.
Using a simple example, let’s consider 1 dimension of a dialog as 250 pixels wide. The dialog contains a label control that is 15 pixels to the right of the left most edge of the dialog, and that label is 45 pixels in length. The textbox control next to the label starts at position 70, so there are 10 pixels between the right edge of the label control and the left edge of the textbox control. Now, let’s say that textbox control is 150 pixels wide. By calculating the width of the 2 controls plus the distance between the controls we can see that truncation will occur on this dialog. Similarly, we can also evaluate the relative position of controls on a dialog and detect alignment both horizontally and vertically more accurately than the human eye.
Of course this is not a simple solution, but if you have thousands of dialogs and property sheets, and multiple language versions investing in an automated solution may be invaluable. One internal case study testing efficiency increased and significantly reduced manual testing and overall direct costs, and the effectiveness/accuracy of reported issues also increased. Perhaps not for everyone, but it is possible!
Originally Published Tuesday, November 03, 2009
Part 1 provided an overview of localization class issues, and Part II discussed issues with non-translated strings in a localized product and gave some helpful hints to manage that problem during the software development lifecycle. In Part III I will cover various issues with access key mnemonics. An access key mnemonic is the underlined letter on a menu or control that corresponds to a key on the native keyboard layout for a particular language.
Missing & duplicate access key mnemonics
Interestingly enough, most localization tools have built in tools to test for duplicate key mnemonics; however, missing or duplicate access key mnemonics is another significant issue in localization testing, and also affects the English language version as well. Duplicate or missing key mnemonics can adversely affect the usability of software because it impacts the ability of the user to easily access or invoke commonly repeated functions using the keyboard. Duplicate or missing duplicate key mnemonics can also negatively impact the software’s ability to meet certain accessibility requirements.
Although missing or duplicate access key mnemonics are sometimes caused by poorly designed dialogs with an overabundance of controls, there are other factors that can cause duplicate key mnemonics. For example, some controls may dynamically appear in some dialogs in specific machine states. These dynamically generated controls may also come from a file that is different than the file which generated the dialog. Another reason for duplicate key mnemonics could also be dynamically generated key mnemonic assignments, which are especially problematic in situations where a dialog contains a mixture of statically assigned key mnemonics and dynamically generated key mnemonics.
Manual testing for missing or duplicate key mnemonics is especially labor intensive, and finding ways to automate this testing will save countless hours of time sitting in front of a computer checking menus and dialogs. There is also a large probability of missing duplicate key mnemonic assignment problems using manual testing methods because eyes get tired, people get bored, and some keys are grayed out (as in the illustration below) or may not be present in certain machine states. Fortunately, there are several automation tools that detect duplicate key mnemonic problems and automated detection is more effective than manual test approaches. For example, Automation.Element.AccessKeyProperty in the UIAutomation class library in C# is one approach to more efficiently test access key mnemonics.
Access key mnemonic assignments
As a general rule of thumb (heuristic), key mnemonics should be assigned to characters mapped to the default state of the keyboard for each particular language. For the Latin 1 family of languages, access key mnemonics should generally not be assigned to non-ASCII characters; even if that particular character is accessible on the default state keyboard layout for a particular language. Certain, access key mnemonics should never be assigned to a character glyph that is formed through combining characters used in languages such as Thai and Arabic. Also, punctuation characters should never be assigned as access key mnemonics.
Of course, the default keyboard layout for many non-Latin 1 languages only contain characters in the native script for that language, and assigning non-ASCII characters as access key mnemonics may be the only choice. However,
Japanese hiragana and katakana glyphs, Korean Hangeul glyphs, and all East Asian ideographic glyphs are invalid character assignments for access key mnemonics. The default keyboard layout for most East Asian languages (Japanese, Korean, Simplified Chinese, and Traditional Chinese) is the standard keyboard layout similar to the US English keyboard. In the above example, their is no way for a Japanese user to access the ‘My Computer’ (マイ コンピユータ）menu item because it is using a katakana character as an access key mnemonic (which violates several guidelines for access key mnemonic assignment). Also, the standard key mnemonic guidelines described below should be used for all East Asian language versions for consistency and backwards compatibility.
Another general guideline to follow for access key mnemonic assignments is to avoid the lower case Latin letters ‘g’ ‘y’ ‘p’ ‘q’ and ‘j’ because there is a high probability of confusion especially with high display resolutions especially with the letters i and l, and q and g. These letters could also be hard to discern on high resolution desktop settings as well. If the number of controls on a single dialog or in a menu list require usage of inappropriate key mnemonics, then perhaps the problem is the design of the dialog.
East Asian language versions should use the identical key mnemonics as the English language version. The characters assigned as key mnemonics in the East Asian language are capitalized, enclosed within parenthesis, and positioned at the end of the translated string. Even when a key mnemonic appears within words or acronyms which are not translated or transliterated into the target the key mnemonic should be relocated to the end of the string and enclosed within parenthesis for consistency.
The character assigned as the key mnemonic should be capitalized because many East Asian computer users use an English key keyboard, and for users whose native language does not frequently employ Latin characters it is much easier for those users to visually identify key mnemonics which are capitalized with keys on the keyboard which are also capital case characters.
Accelerator keys are commonly refered to as shortcut keys. Accelerator keys are keys (such as F1 – F12 and Esc) or key combinations (Ctrl + Shift + B, or Ctrl C) that allow users to evoke certain functions without navigating the software menus via access keys or using the mouse to click button controls on a toolbar. Here is good source for common Windows accelerator keys, and here is one for common accelerator keys for Office products.
Shortcut key combinations are common throughout all language versions. Contrary to the Wikipedia entry on the subject some language versions localize the letter key (not a mnemonic…it is not underlined). For example, in German the Ctrl key is localized as "Strg" and, dispite it is generally frowned upon to change the ASCII upper case letter assigned to an accelerator key combination the Spanish versions of Office use Ctrl+G (Guardar) for Save instead of Ctrl+S, and Ctrl+N (Negrita) for Bold instead of Ctrl+B. Also, letter keys used as part of an accelerator key combination are capitalized. East Asian language versions use Ctrl to designate the Control key. Also, accelerator key combinations do include an elipsis after the letter.
In Part IV I will discuss common layout issues such as clipping and truncation.
Originally Published Friday, October 30, 2009
I should be of no surprise to anyone that localization testing generally focuses on changes in the user interface, although as mentioned in the previous post these are not the only changes necessary to adapt a product to a specific target market. But, the most common category of localization class bug are usability or behavioral type issues that do involve the user interface. Bugs in this category generally include un-localized or un-translated text, key mnemonics, clipped or truncated text labels and other user interface controls, incorrect tab order, and layout issues. Fortunately, the majority of problems in this category do not require a fix in the software’s functional or business layer. Also, the majority of problems in this category do not require any special linguistic skills in order to identify, and in some cases, an automated approach can be even more effective than the human eye (more on that later).
Perhaps the most commonly reported issue in this category is “un-localized” or un-translated textual string. Unfortunately, in many cases un-translated strings is also an over-reported problem that only serves to flood the the defect tracking database with unnecessary bugs. Translating textual strings is a demanding task, and made even more difficult when there are constant changes in the user interface or contextual meaning of messages early in the product life cycle. Over-reporting of un-translated text too early in the product cycle only serves to artificially inflate the bug count, and causes undue pressure and creates extra work for the localization team.
Identifying this type of bugs is actually pretty easy. Here’s a simple heuristic; if you are testing a non-English language version in a language you are not familiar with and you can clearly read the textual string in English it is probably not localized or translated into the target language. The illustration below provides a pretty good example of this general rule of thumb. A tester doesn’t have to read German to realize that the text in the label control under the first radio button is not German.
There are several causes of un-localized text strings to appear in dialogs and other areas of the user interface. For example:
- Worse case scenario is that the string is hard-coded into the source files
- Perhaps localizers did not have enough time to completely process all strings in a particular file
- Perhaps this is a new string in a file localizers thought was 100% localized
- Strings displayed in some dialogs come from files other than the file that generates the dialog, and the localization team has not process that file
- And, sometimes (usually not often), a string may simply be overlooked during the localization process
Testing for un-localized text is often a manually intensive process of enumerating menus, dialogs, and other user interface dialogs, message boxes and form, and form elements. But, if the textual strings are located in a separate resource file (as they should be), a quick scan of resource files might more quickly reveal un-translated textual strings. Of course, there is little context in the resource file, and I also hope the localization team is reviewing their own work as well prior to handing it over to test.
Also, here are a few suggestions that might help focus localization testing efforts early in the project milestone and reduce the number of ‘known’ or false-positive un-translated text bugs being reported:
- Ask the localization team to report the percentage of translation completion by file or module for each test build. Early in the development lifecycle only modules that are reported to be 100% complete which appear to have un-translated text should be reported as valid bugs. Of course, sometimes some strings are used in multiple modules, or may be coming from external resources. But, especially early in the development lifecycle reporting a gaggle of un-translated text bugs is simply “make work.” As the life cycle starts winding down…all strings are fair game for bug hunters!
- Testers should use tools such a Spy++ or Reflector to help identify the module or other resources, and the unique resource ID for the problematic string or resource. This is much better then than simply attaching an image of the offending dialog to a defect report. Identifying the module and the specific resource ID number allows the localization team to affect a quick fix instead of having to search for the dialog through repro steps and track down the problem.
- Also remember that not all textual strings are translated into a specific target language. Registered or trademarked product names are often not translated into different languages. In case of doubt, ask the localization team if a string that appears un-localized is a ‘true’ problem or not.
Unlocalized strings usually due to hard coded strings also tend to occur in menu items. This is especially true in the Windows Start menu or sub-menu items hard-coded in the INF or other installation/setup files. For example, the image on the right shows a common problem on European versions of Windows. Many European language versions localize the name of the Program Files folder, and the menu item in the start menu. But, often times when we install an English language version of software to Windows it creates a new "Programs" menu item (and even a new Program Files directory, rather than detecting the default folder to install to. In the example on the left, the string Accessories is a hard-coded folder name. But, there is another issue as well. This illustrates not only a problem with the non-translated string "Accessories," but also shows one full-width Katakana string for ‘Accessories’ and another half-width string.
In part 3 I will discuss another often problematic area in localization….key mnemonics.
Originally Published Tuesday, October 27, 2009
When I first joined Microsoft 15 years ago I was on the Windows 95 International team. Our team was responsible for reducing the delta between the release of the English version and the Japanese version to 90 days, and I am very proud to say that we achieved that goal and Windows 95 took Japan by storm. It was so amazing that even people without computers were lined up outside of sales outlets waiting to purchase a copy of Windows 95. The growth of personal computers in Japan shot through the roof over the next few years. Today the Chinese market is exploding, and eastern European nations are experiencing unprecedented growth as well. While the demand for the English language versions of our software still remains high, many of our customers are demanding software that is ‘localized’ to accommodate the customers national conventions, language, and even locally available hardware. Although much of the Internets content is in English, non-English sites on the web are growing, and even ICANN is considering allowing international domain names that contain non-ASCII characters this week in Seoul, Korea.
But, a lot has changed in how we develop software to support international markets. International versions of Windows 95 were developed on a forked code base. Basically, this means the source code contained #ifdefs to instruct the compiler to compile different parts of the source code depending on the language family. From a testing perspective this is a nightmare, because if the underlying code base of a localized version is fundamentally different than the base (US English) version then the testing problem is magnified because there is a lot of functionality that must be retested. Fortunately today, much software being produced is based on a single-worldwide binary model. (I briefly explained the single world wide binary concept at a talk in 1991, and Michael Kaplan talks about the advantages here.) In a nutshell, a single worldwide binary model is a development approach in which any functionality any user anywhere in the world might need is codified in the core source code so we don’t need to modify the core code once it is compiled to include some language/locale specific functionality. For example, it was impossible to input Japanese text into Notepad on an English version of Windows 95 using an Input Method Editor (IME); I needed the localized Japanese version. But, on the English version of Windows Xp, Vista, or Windows 7 all I have to do is install the appropriate keyboard drivers and font files and expose the IME functionality. In fact, these days I can map my keyboard to over 150 different layouts and install fonts for all defined Unicode characters on any language version of the Windows operating system.
The big advantage of the single worldwide binary development model is that it allows us to differentiate between globalization testing and localization testing. At Microsoft we define globalization as “the process of designing and implementing a product and/or content (including text and non-text elements) so that it can accommodate any locale market (locale).” And, we define localization as “the process of adapting a product and/or content to meet the language, cultural and political expectations and/or requirements of a specific target market.” This means we can better focus on the specific types of issues that each testing approach is most effective at identifying. For localization testing, this means we can focus on the specific things that change in the software during the “adaptation processes” to localize a product for each specific target market.
The most obvious adaptation process is the ‘localization’ or actually the translation of the user interface textual elements such as menu labels, text in label controls, and other string resources that are commonly exposed to the end user. However, the translation of string resources is not the only thing that occurs during the localization process. The localization processes that are required to adapt a software product to a specific local may also include other changes such as font files and drivers installed by default, registry keys set differently, drivers to support locale specific hardware devices, etc.
3 Categories of Localization Class Bugs
I am a big fan of developing a bug taxonomic hierarchy as part of my defect tracking database as a best practice because it better enables me to analyze bug data more efficiently. If I see a particular category of bug or a type of bug in a category that is being reported a lot, then perhaps we should find ways to prevent or at least minimize the problem from occurring later in the development lifecycle. After years of analyzing different bugs, I classified all localization class bugs into 3 categories; functionality, behavioral/usability, and linguistic quality.
Functionality type bugs exposed in localized software affect the functionality of the software and require a fix in the core source code. Fortunately, with a single worldwide binary development model where the core functionality is separated from the user interface the number of bugs in this category of localization class bugs is relatively small. Checking the appropriate registry keys are set and files are installed in a new build is reasonably straight forward and should be built into the build verification test (BVT) suite. Other types of things that should be checked include application and hardware compatibility. It is important to identify these types of problems early because they are quite costly to correct, and can have a pretty large ripple effect on the testing effort.
Behavioral and usability issues primarily impact the usefulness or aesthetic quality of the user interface elements. Many of the problems in this category do not require a fix in the core functional layer of the source code. The types of bugs in this category include layout issues, un-translated text, key mnemonic issues, and other problems that are often fixed in the user interface form design, form class, or form element properties. This category of problems often accounts for more than 90% of all localization class bugs. Fortunately, the majority of problems in this category do not require any special linguistic skills; a careful eye for detail is all that is required to expose even the most discrete bugs in this category.
The final category of localization class bug is linguistic quality. This category of bugs are primarily mis-translations. Obviously, the ability to read the language being tested is required to identify most problems in this category of errors. We found testers spent a lot of time looking for this type of bug, but later found the majority of linguistic quality type issues reported were resolved as won’t fix. There are many reasons for this, but here is my position on this type of testing….Stop wasting the time of your test team to validate the linguistic quality of the product. If this is a problem then hire a new localizer, hire a team of ‘editors’ to review the work of the localizer, or hire a professional linguistic specialist from the target locale as an auditor. Certainly, if testers see an obvious gaff in a translation then we must report it; however, testers are generally not linguistic experts (even in their native tongue), and I would never advocate hiring testers simply based on linguistic skills nor as a manager would I want to dedicate my valuable testing resources on validating linguistic quality…that’s usually not where their expertise lies, and it probably shouldn’t be.
Since behavioral /usability category issues are the most prevalent type of localization class bug this series of posts will focus on localization testing designed to evaluate user interface elements and resources. The next post will expose the often single most reported bug in this category.
I love autumn! Yes, I am definitely a boy of summer and very much prefer warmer weather; however, there is something special about autumn. This past weekend my daughter, and my 2 friends Dongyi and her husband Yuning and I participated in the Rum Run sailboat fun race with an overnight raft up at Bainbridge Island’s Port Madison. Saturday morning was quite rainy, but the wind was blowing 15 knots with gusts to 25 knots and NOAA weather radio announcing gale force warnings in Puget Sound. Wow…what a ride! But, it was actually the rather relaxing sail back to my marina on Sunday morning that rekindled the beauty of autumn in my mind. The bright reds, golden yellows, and pastel browns of the foliage seemed to blend into a collage framed by the darkness of the waters of Puget Sound and the snow covered peaks of the Olympic mountains. The beauty of autumn reminds me about change. A sloughing of the old, the cleansing brought about by the pure white snows, eventually followed by the new and fresh growth that blossoms in spring.
Just as the earth goes through variable cycles of rejuvenation, we must also continually update our tests, and (more importantly) the test data we use in our test cases to prevent them from becoming stale. Trees shed their leaves in the autumn and new leaves emerge in the spring, but the tree is fundamentally still the same tree. Similarly, a well-designed test case has a unique fundamental purpose and by changing the variables we can grow the value of that test case. Of course, the cycle of change in test data should be dramatically shorter in duration as compared to the seasonal changes of mother earth.
Here is a simple example of how a well-designed test case using variable test data can increase the value of the information each test iteration provides through increased confidence and also potentially reduce overall risk. In my role at Microsoft I am in a unique position to not only conduct controlled studies, but I can also implement ideas into practice on enterprise level software projects. One experiment I started about 2 years ago involved multiple groups of testers (sessions) located around the world divided into 3 separate control groups. Each control group tested the identical web page that would display the stock price if the user input a valid stock ticker symbol into a single textbox on the page and pressed the OK button. The only difference in the control groups was the instructions to perform single positive test case with the specific purpose of “ensure any valid stock ticker symbol displays the current stock price for the publicly traded stock specified by its symbol.” The purpose of the study was to determine if different cultural and experiential backgrounds impacted the test data used in a test based on the instructions for a test case. The study collected demographic information on the participants as well as specific inputs applied to the web page. Information on the oracle used by the students was collected anecdotally. Step one in each test was identical because we were not interested in how the tester launched the browser. (Of course this assumes there are other tests that test the multitude of ways to launch a browser and navigate to a URL. Also, if the browser failed to launch the test case is blocked.)
Group 1 was given the most vague instructions for the test case. The instruction was simply:
- Launch browser and navigate to [url address]
- Enter a valid stock ticker symbol and press the OK button and verify the accuracy of the returned stock price.
The instructions in the test case given to Group 2 were also somewhat vague, but provided a little guidance both on input and oracle.
- Launch browser and navigate to [url address]
- Enter a valid stock ticker symbol (e.g. “MSFT”)
- Press the OK button
- Verify the returned stock price is identical to the current stock price listed on the appropriate exchange
Group 3 had similar instructions to Group 2, but the group was given additional guidance as indicated below.
- Launch browser and navigate to [url address]
- Enter a valid stock ticker symbol from a publicly traded stock listed on any public stock exchange
Listings of valid stock ticker symbols are on stock exchange web sites such as:
- Press the OK Button
- Verify the returned stock price is identical to the current stock price listed on the appropriate exchange
The results were mostly not surprising, but rather reinforcing. For example, we expected Group 1 to be rather random, but mostly aligned with ticker symbols they were familiar with. Of course, the majority (90%) of stock ticker symbols entered was MSFT and there was no significant difference in cultural background, locale, experience or educational background. (As this study was conducted at Microsoft I am sure there was some bias as to the symbol entered.) What was most interesting was that testers with no formal training (no previous courses in testing, no CS degree, and read less than one discipline specific book) and with more than 2 years of test experience were approximately more likely (25%) to violate the purpose of the test and enter random or completely invalid data as their first action. In other words, instead of executing the required test their initial reaction was to immediately go on a bug hunt.
In group 2 99% of the participants simply entered the stock ticker symbol “MSFT.” But, what was even more surprising was the fact that one the next day, the same people in that group were given the same exact test, and 95% of them simply reentered MSFT. Perhaps this is laziness, perhaps this is related to the superficial nature of the study, or perhaps this is due to individuals taking the path of least resistance. The percentage of people who entered identical stock ticker symbols on consecutive days was not significantly different between group 1 and group 2.
It should be no surprise that group 3 had the greatest distribution of variable test data applied to the web page. Demographics had no impact on any of the people who were in group 3. The majority of people in group 3 (78%) would select the first stock exchange listed (regardless of what link it was) but there was no significant overlap in the selected stock ticker symbols. When asked to repeat the test on the next day 83% of the participants selected a different link and and a different symbol. Of those who selected the same link 97% selected a different stock ticker symbol. On the down side, approximately 4% of the people simply took the path of least resistance and input MSFT as the test data on both days of the experiment.
One of the most common problems I hear about ‘scripted,’ or pre-defined test cases is that they are too prescriptive and not flexible enough to allow the tester to try things. Of course, a well-designed test case is not simply a prescriptive set of steps inputting the same hard coded test data they run over and over. So, in this study we made the assumption that a scripted test case that specified “Enter MSFT in the textbox” would simply result in the tester entering “MSFT” without any thinking on the part of the tester. Hard-coding variable test data is often times the worse possible way to design a test case.
Vaguely written test cases added some level of variability, but also seemed to increase the probability of the tester executing context free tests outside the scope of the purpose of the test. In fact, what we found was some testers (approx 2%) simply went on a bug hunt and never actually input a valid stock ticker symbol at all during the session.
A test case that provided only one example that is representative of the type of test data required for the test case produced the least desirable results in this study. I am not sure this would be the case in practice. However, based on this study if I were to outsource execution of a test case similar to that used by group 2 the only thing I could guarantee is that MSFT would definitely be tested numerous times, and the variability of other test data would be extremely limited regardless of the number of testers executing that test or the number of iterations.
When faced with a virtually infinite number of possibilities for input variables as test data used in either positive or negative tests we need to test as many possibilities as possible given the available resources in order to increase test coverage and reduce overall risk. So, one way increase the coverage of test data while still achieving the specific purpose of the test case is to provide useful resources that help guide the tester while relying on the tester’s creative thinking skills and curiosity to expand the test coverage.
Of course, we can also increase variability of test data and capture the essence of the tester’s creativity using a similar approach in a well-designed automated test case as well. In fact, a similarly designed automated test case enables us to significantly increase the amount of variable test data that is exercised in order to expand test coverage and increase overall confidence.
Originally Published Sunday, October 11, 2009
A significant percentage of static test data is stored in tabular comma delimited or tab-delimited formats and saved in Excel spreadsheets. Reading in comma or tab-delimited static test data into an automated test is pretty straight forward and there are numerous examples in many programming languages illustrating how to read in these types of test data repositories. Reading in rows of data is the foundation of data-driven automation and definitely has its place in any automation project.
I am a big proponent of stochastic (random) test data generation that is customized to the context, but I also know that sometimes static test data is useful for establishing baselines and more exact emulation of ‘real-world’ customer-like inputs. But, if the automated test is simply passing the same variable arguments to the same input parameters in the same order over and over again the value of subsequent iterations of that automated test using that static data set diminishes rather quickly. So how can we more effectively utilize static test data in our automated tests?
One possible solution is to randomly select an argument from a collection of static variables that is passed to the specific input parameter. The advantage of this approach is that it effectively increases the test data permutations in each iteration of the test case. For example, let’s consider 2 input parameters; one for a given name and one for a surname. In a traditional data-driven approach in which the static test data is read in by rows our test data file might be similar to:
This static data file would give us 4 sets of test data, but each time the test data is read into the test case the given and surnames are always the same.
However, if we read in the given names and surnames into 2 collections, and then randomly select a given name and surname from the appropriate collection to pass to the respective parameter we effectively have 16 possible combinations of static test data to work with. An advantage of this approach is that our ‘collections’ of given names and surnames can contain differing numbers of elements (in which case the number of possible combinations of test data is the Cartesian product of the number of elements in each collection).
Of course there are many ways to accomplish this. For example, one approach is to continue to use a comma or tab-delimited file format and list given names in one row and surnames in a second row. Another approach is to list the given names and surnames in columns in a spreadsheet and read in each column into a collection of some sort. The latter is the approach I used in developing my PseudoName test data generator tool. I chose this approach for 2 reasons; first an Excel spreadsheet is a simple yet powerful file format for storing static test data, and secondly because lists of test data are sometimes better represented in columns rather than rows.
The following code shows one way to read in test data by columns from an Excel spreadsheet.
I must tell you that performance can be an issue especially if the columns contain a lot of data. For example, to read in approximately 700 elements of test data in 3 separate columns took slightly less than 1 second, and reading in 1800 elements in 3 columns required just over 4 seconds. Unfortunately, I didn’t compare total byte counts, but it is pretty obvious the greater the number of test data elements being read the longer the read operation will take and you certainly will have to take the read time into consideration in your automated test case.
Reading static test data line by line from a data file while looping through a data-driven automated test case is a useful test design approach in some situations, this is another useful approach that will allow the test designer to randomize the combinations of static test data values applied to multiple input parameters in multiple iterations of an automated test case.
Originally Published Thursday, October 01, 2009
Software is knowledge. Software is the intangible product crafted by a team of people who have pooled their intellectual knowledge to help solve a complex problem and add value to those who use that software. So how does a tester contribute to the intellectual knowledge pool?
I guess we could say that finding and reporting bugs during the software development lifecycle (SDLC) is important knowledge because it helps identify many anomalies prior to release. But, the mere act of finding and reporting bugs is transient knowledge. Reporting bugs in the system does not add any long term or persistent value to the intellectual knowledge pool of a software company. Perhaps even worse, finding the same type of issue repeatedly actually stagnates the intellectual knowledge pool because the team is focused on fixing the same problem over and over again. Of course, finding really interesting and important bugs requires a lot of knowledge and creativity. But, once the bug is fixed the value that bug may have provided in the intellectual knowledge pool evaporates; especially if there is no shared learning experience that occurs as a result of that fixed issue.
One way software testers can significantly contribute to the intellectual knowledge pool is through defect prevention instead of defect detection. Simply put, if we expand our vision of the role of the tester to include problem solving instead of just problem finding we can open up new challenges, provide overall greater value to our business, and further advance the discipline of software testing. For example, if we were to identify a particular area or category of defects and identify the root causes for that type of problem then we can implement various strategies or best practices to prevent those types of issues from being injected into the product design from the onset, or at least develop testing patterns or tools to help the team identify many of that category of issues sooner in the SDLC. Understanding why certain categories of problems occur and providing best practice solutions within the appropriate context is intellectual knowledge that can be shared with existing and new team members, and can persist to help prevent certain types of problems from recurring in the future. This is intellectual capital in the knowledge pool. Testing tool and test patterns that can be shared and taught to others that help identify certain types of issues sooner can reduce testing costs. This is also intellectual capital in the knowledge pool.
If I am constantly burdened with finding the same types of problems over and over again, then my contribution to the SDLC and the knowledge pool is essentially limited to the bugs I find, and the value of those bugs often depreciates rapidly. Basically, I am simply identifying problems; I am not contributing to solving the problems.
Of course, I don’t think testers will ever work themselves out of a job and we will always be in the business of identify issues during the SDLC. But, if I solve one type of problem, then I can move on to face new and more difficult challenges. By solving one problem I get that job off my plate, and I can then move onto the next job. Organizational maturity and professional growth occurs through solving increasingly complex problems, not by continually dealing with the same problem.
I think the role of a professional tester is growing beyond that of simple problem identification, and many of us are exploring the more challenging aspects of problem solving. Finding ways to prevent defects or identify issues earlier, and essentially drive quality upstream are exciting challenges that will advance the practice of software testing and increase the value of our contributions to the intellectual knowledge pool and advance the profession of software testing.
Originally Published Thursday, September 24, 2009
The past 2 weeks have been a bit rough. While in Israel I began to feel a bit congested. By the time I hit Nürmberg, Germany for 12th International Conference on Quality Engineering in Software Technology I was injecting nose-juice (nasal decongestant) about every 2-3 hours and couldn’t sleep through the night. Fortunately I didn’t speak until Friday, so Monday morning I visited a local Apotheke (Pharmacy), described the symptoms, and was presented with some medicinal remedies by the pharmacist. By Wednesday I was much worse, so again tried another pharmacy and was given a different batch of drugs. By Friday morning I was struggling, but managed to present my talk on probabilistic stochastic test data generation using parameterized equivalent partitions and genetic algorithms (which I will discuss in a future post). Unfortunately, I had to cancel another engagement and reschedule my flight home for Saturday. Once home I went to my doctor and was quickly diagnosed with a bacterial infection in my nasal cavities.
Now, I am not telling you this story to seek your sympathy, but to illustrate a point. I had convinced myself that I simply had a slight cold that I could treat with over-the-counter remedies, and perhaps due to my own stubborn nature I refused advice from my friends in Germany to see a physician. In the end, I realized I was simply treating the symptoms and ignoring the root cause of the real problem. So, I sometimes wonder if we are too focused on treating the symptoms of buggy software by focusing our testing efforts on bug detection rather than addressing the real problem and thinking more about bug prevention.
In my opinion, one of the most significant ways we can directly impact quality of the product and the effectiveness of our teams is not by trying to beat the bugs out of the product after the designers and developers have spent days/weeks injecting bugs into the product, but through partnering with the PMs and developers earlier in the lifecycle to prevent issues from ever getting into the product to begin with. If we continue to think of testing as an after-the-fact process than we might never advance our discipline, and perhaps even worse, we might relegate the role of testers to nothing more than bug-finders.
Defect prevention doesn’t negate or eliminate the need for system level testing, but it could certainly change the role of testers throughout any product lifecycle. Rather than perpetuating an adversarial “don’t trust the developer” attitude I envision testers and developers working in a more symbiotic relationship (Доверяй, но проверяй – Trust, but verify). For example, I think many readers would agree that developers are responsible for unit testing, but I wonder how many testers are proactively engaging their development partners and suggesting ways to improve the effectiveness of their unit tests (without adding significant additional overhead), or participating in code inspections. And, how many testers are engaged in design reviews and prototyping with program managers and designers in an effort to prevent sub-optimal designs which often leads to a tremendous amount of rework.
The ability to move quality upstream through defect prevention requires different skills and capabilities, but also opens up new and greater challenges for software testers.
“Bug prevention is testing’s first goal.” – B. Beizer