I.M. Testy

Treatises on the practice of software testing

Archive for the ‘Test Data’ tag

A Source of “Real-World” Test Data for Globalization Testing

with 4 comments

I am generally not a big fan of static test data. I do know that in the proper context static test data can provide some value. Of course we should be aware of the common problems with files of static test data or (even worse) hard coded test data in a test case. Some problems with static test data include:

  • Stagnation – static test data may add some initial value, but over time simply reusing the same test data over and over in a test diminishes the value of that test. For example, retesting the same name strings in a first name input textbox is not providing any new information if those ‘static’ names worked in the previous build and the underlying functionality has not changed.
  • Contextual blindness – sometimes we have files of static test data that was identified as “problematic” in one situation (context), so we reuse the “problematic” test data regardless of the context. In 1995 I wrote a white paper on “problematic double-byte encoded characters (DBCS) explaining why each code point was problematic in a given context. For example, a Japanese character that began with a 0x5C trail-byte might be problematic in a filename on an ANSI based system that parsed characters by bytes instead of wide bytes. This is not true on Windows systems where the default encoding is the Unicode transformation format of UTF-16. However, some people continue to use obsolete DBCS problem characters perhaps because they don’t fully understand the underlying contextual differences between ANSI based encodings and Unicode.

Perhaps on the opposite end of the test data spectrum is random test data. Many of you that read this blog or have heard me speak know that I am a big proponent of parameterized random test data generation. Parameterization allows us to better model our test data. I know that even parameterized random data can be crafted to be representative of real data, but it is not “real” data.

But, there may be a happy medium between static test data and random test data. And, best of all it is abundantly available. One of the best sources for (especially non-English) test data comes from sources that most of us already use on a daily basis. The test data source I speak of are social networks.

I have met many wonderful people from around the world both in person and virtually, and stayed in contact with many of them. Last year while keynoting at the first software testing conference in Vietnam (VistaCon 2010) I was privileged to meet my dear friend Thuyen, who helped organize the conference. Since the conference we have stayed in contact via email and Facebook. When she posts on Facebook it is usually in Vietnamese. Since I don’t (yet) read Vietnamese I use Bing Translator to help me figure out the comment.

Last week she had an entry on her Facebook wall that began “Tối nay vô tình nghe trên TV 1 bài hát mà giai điệu…” So, I copied the entry and opened Bing Translator to translate the entry.

image

Many of you will quickly notice the strange anomaly in the translation. I initially thought that this service might be incrementing this numeric value for some reason, but when I changed the number value to 2 the number 2 displayed in the translated string. I tried various other numbers and quickly discovered that 6 incremented to the number 7, 8 decremented to 7, and 9 decremented to the number 3. I didn’t see a clear pattern here so I thought this might be an issues resulting from parsing a particular sequence of characters.

So, I modified different parts of the string (removed words) to narrow down the problem. I found the string “tình nghe trên TV 1 bài hát mà giai điệu” contained the problematic sequence. Removing any ‘word’ from this string displayed the translated string with a number of 1, with the exception of 1 word. Removing the word “nghe” from the above string resulted in the translation illustrated below.

image

imageBy the way…the Google translation engine doesn’t fair much better. And, the results are different between www.google.com/ig and http://translate.google.com.

But, the purpose of this post is not to illustrate this particular bug, but to give you ideas of how we can use social network feeds in our testing. People around the world use social networks and you can find “real world” strings in various languages that you can use as test data in various contexts. Most of the time this ‘test data’ will not likely result in a bug; but sometimes it can reveal interesting issues. Best of all, strings taken from social networks are not some manufactured static or random test data. Using strings copied from social networks is about as “real world” as we can get…this is the “data” from our customers.

Written by Bj Rollison

January 23rd, 2011 at 10:14 am

Random String Generation…Update!

without comments

Originally Published Tuesday, February 17, 2009

One of the biggest challenges in input testing is the sheer amount of potential characters and the virtually infinite number of permutations of those characters in different character positions in a string. Even if we know about the myriad of language scripts used throughout the world, manually generating characters from multiple language groups would be excruciatingly inefficient.

Since any modern application should support Unicode character we can assert the strings “abcdefg” and “ڄƥ藖꼩昨”are equivalent for most input testing requiring a Unicode string. So, random string test data generation is useful for easily increasing the breadth of test data tested, and also for testing the robustness of the applications ability to process complex data streams.

Babel 2.0 is a free test tool, and one of the few random string generators that can generate a string of character across the entire Unicode spectrum, since its initial release in 2006 it has been widely popular. So, I am happy to announce that an updated Babel 2.0 is released! I know this constitutes a shameless plug…but, sometimes it helps to plug tools we’ve made that can benefit other testers or developers.

Unlike many string generators that only produce a string of random ASCII characters, Babel can produce a string of random Unicode characters defined in the Unicode 5.1 specification, including surrogate pair characters (which often expose problems in various text boxes…hint, hint). Additional updates to Babel 2.0 include:

  • Updated to the Unicode 5.1 spec (including new script groups and character code points)
  • Ability to include/exclude combining character code points
  • Ability to include/exclude reserved NetBIOS characters
  • Custom range allows character generation from 0×01 through 0xFFFF.
  • Ability to generate strings with a max length of 100,000 characters
  • Improved distribution of characters from the selected language script groups

The following illustration provides a basic flow diagram of how Babel generates random strings. Essentially, one script group is randomly selected from all selected script group nodes, and all code points assigned to that script group are put into a collection. Next, one character is randomly selected from that collection and is appended to a string. This process continues until the string length equals a specified number of characters.

Babel

Better distribution of character selection across multiple script groups occurs by preventing the same script group from being selected before at least ½ of the other specified groups are selected. This means that as long as more than one script group node is selected the selected group of characters will be removed from the random selection process until at least half of the other script groups are chosen. This provides a greater distribution as compared to simple random generation.

The download also includes the Babel.DLL (and the dependent UnicodeData.DLL) for test automation. The older methods are deprecated and no longer supported. The new methods have been simplified and now include:

public static string Polyglot (int, int, bool, bool, bool, bool, bool)
Returns a string of random Unicode characters in all Unicode script groups based on a specified seed value.

public static string Polyglot (int, bool, bool, bool, bool, bool, out int)
Generates a random seed value and returns a string of random Unicode string of characters in all Unicode script groups, and passes a reference to the seed value.

public static string Polyglot ( int, int, bool, bool, bool, bool, bool, char, char)
Returns a string of random Unicode string of characters in all Unicode script groups based on a specified seed value

public static string Polyglot (int, bool, bool, bool, bool, bool, char, char, out int)
Generates a random seed value and returns a string of random Unicode string of characters in all Unicode script groups, and passes a reference to the seed value.

Get the new release of Babel 2.0 !

Written by Bj Rollison

November 18th, 2009 at 8:10 pm

Posted in Testing Tools

Tagged with ,

Test Automation: Saving Random Data

with 4 comments

Originally Published Tuesday, May 13, 2008

Now, many of you probably know that I am a big fan of computer generated random test data that is a represents a reasonable sample data set from the total population of possible test data. (I refer to this a probabilistic stochastic test data.) So, why would I argue against preserving randomly generated test data?

I just returned from STAREast, where for the second time in a month I heard someone suggest storing randomly generated test data in a file. Many people will site the inability to recreate random test data as a drawback to using randomly generated test data in a test. So, the reason these people suggested storing the random data in a file is so they can easily repeat a test with the same data should some randomly generated test data expose an anomaly. I absolutely concur that if we generate random test data, and that test data exposes a problem we need a way to recreate the data. But, isn’t there a better way than to save random test data in a file?

Saving randomly generated test data to a file creates a test artifact. Depending on how much randomly generated data is generated, this file could become quite large. Also, saving data to a file impacts the performance of an automated test and certainly slows down manual execution of tests. Then consider the number of tests that generate random test data are executed numerous times throughout the lifecycle, and it doesn’t take long until we have countless test artifacts simply storing more static test data that quickly loses its value (especially if no problems were detected). Of course, we can easily delete the files after the test if no anomaly was detected, but I suspect that most testers will delete those files upon the completion of the test if no problems were detected.

So, the question is how can we reproduce computer generated probabilistic stochastic test data if we don’t save that randomly generated data to a file?

Planting Seeds

In computing, a seed is simply an integer value that is used by a random generator as the starting value. If we pass a seed value as an argument to a given random generator then we will consistently get the same random value each and every time. Essentially, a seed allows us to replicate computer generated probabilistic stochastic test data anytime as long as we use the same seed and the same random generator algorithm. So, instead of saving each and every piece of randomly generated test data used in any given test, we can simply log the seed value used by that test in the test results log file.

But, if we use the same seed all the time, then we are simply generating the same data over and over again. And, manually inputting a seed for each test that generates probabilistic stochastic test data is not an ideal situation, especially for automated tests. So, to solve that problem we can randomly generate a seed value that is then passed to the random generator algorithm!  Again, logging the randomly generated seed allows us to accurately reproduce the probabilistic stochastic test data at any later time.

The example below illustrates a simple method in C# that will either generate a random seed or return a user specified seed value.

   1: public static int GetSeedValue(string seedValue)

   2: {

   3:     // check if user specified seed value is passed as an arguement to 

   4:     // the seedValue parameter

   5:     if (seedValue == string.Empty)

   6:     {

   7:         // Create a new random object

   8:         Random randomObject = new Random();

   9:         // Generate a random integer value between 0 and 2,147,483,647

  10:         return randomObject.Next();

  11:     }

  12:     else

  13:     {

  14:         // convert the seedValue to an integer value

  15:         // NOTE: This example method does not include exception handling

  16:         return int.Parse(seedValue);

  17:     }

  18: }

The following example illustrates how to use this method to get a random seed value to generate random strings and numbers that increase the breadth of test data coverage in each subsequent iteration of a test.

   1: static void Main(string[] args)

   2: {

   3:     // These variables declare the range of characters used for the

   4:     // string test data. In this case the strings are composed of upper

   5:     // case ASCII characters 'A' through 'Z'

   6:     char minChar = '\u0041';

   7:     char maxChar = '\u005A';

   8:     

   9:     // This reads the user specified seed value from the console window

  10:     // If no seed value is specified an empty string is passed to the 

  11:     // GetRandomSeed method which will cause it to generate a random 

  12:     // seed value.

  13:     string mySeed = Console.ReadLine();

  14:     

  15:     // Declare a seed variable and initialize it to either the user

  16:     // specified seed or to a computer generated random seed value

  17:     int seed = GetSeedValue(mySeed);

  18:  

  19:     // The seed value should be permenently recorded in the logged

  20:     // results for this test

  21:     Console.WriteLine("The seed value for this test is {0}\n", seed);

  22:  

  23:     // Create a new random object based on the seed

  24:     Random randomGeneratorObject = new Random(seed);

  25:  

  26:     // Generate 10 random strings

  27:     for (int count = 0; count < 10; count++)

  28:     {

  29:         // Declare and initialize a string variable for our test data

  30:         string testString = string.Empty;

  31:         // Generate random length strings between 1 and 10 characters

  32:         for (int length = 0; length < randomGeneratorObject.Next(1, 11); length++)

  33:         {

  34:             // Generate a random character within the defined range and

  35:             // concatenate it to the testString variable until the 

  36:             // random string length has been reached

  37:             testString += Convert.ToChar(randomGeneratorObject.Next(

  38:                 minChar, maxChar + 1)).ToString();

  39:         }

  40:  

  41:         // Write the test string to the console window

  42:         Console.WriteLine("Test String {0}: {1}", count + 1, testString);

  43:     }

  44:  

  45:     Console.WriteLine("\nRandom numbers");

  46:     // Generate 5 random numbers

  47:     for (int numberCount = 0; numberCount < 5; numberCount++)

  48:     {    

  49:         Console.WriteLine("{0} ", randomGeneratorObject.Next());

  50:     }

  51: }

Calling the Main method and passing an integer value between 0 and 2,147,483,647 will generate 10 random length strings composed of random upper case characters between ‘A’ and ‘Z’ and 5 random numbers. If no user specified seed is passed to the Main method then the code will call the GetGenerateSeed method and generate a random seed value for use in the test. Of course, passing the same integer value will produce the same strings and numbers each and every time.

Using probabilistic stochastic test data is valuable because it efficiently increases the breadth of data coverage, and significantly augments ‘typical’ static test data, user-generated test data, or static test data derived from historical failure indicators. But, instead of storing randomly generated test data in a file, it is a best practice to simply record the seed value of each test. With a seed value we can easily recreate the computer generated random test data should any of the random data used in a test exposes an anomaly.

Written by Bj Rollison

November 18th, 2009 at 6:31 pm

Posted in Test Automation

Tagged with ,

Babel – A ‘New’ Random Unicode String Generator Test Tool

without comments

Originally Published Thursday, September 20, 2007

For some time I have wanted to add surrogate pair character support to a tool I developed called GString, and this week I managed to find some time to do that work and more! As I developed the methods for surrogate pair support I rewrote (refactored in developer parlance) some of the previous methods to reduce complexity. And wouldn’t you know it…the simple act of refactoring exposed some otherwise hard to find defects (and one pretty obvious one). I discovered these defects because I had to approach the problem space from a different perspective, and that perspective (working primarily with int types instead of char types) exposed the problems.

So, I decided to retire the GString code base, and I ported what I could into a new tool named Babel (and this is my shameless plug for that tool.) I know it is not ‘customer friendly’ when someone goes and renames a tool, especially when it comes with a library for test automation because now the ‘customer’ has to change their references in order to use the functionality in the new DLL. However, the name Babel seems more fitting in the purpose of this tool to generate random characters across the Unicode spectrum of language scripts; and besides Java also has a class called GString and I didn’t want to cause any confusion. :-)

The obvious bug fixed in Babel is a problem that occurred when generating character in the ASCII only range. For some bizarre reason I neglected to exclude Japanese half-width katakana characters (and for an even more bizarre reason I failed to find it; which is a really good reason why unit testing only goes so far and we really need a second set of eyes for sufficient testing). One not so obvious defects included exclusion of a range of code points between U+1A20 and U+1AFF instead of U+1B80 and U+1CFF. This was a classic boundary bug! But unless we did a formal code review it is unlikely this one would have never been found.) The other not so obvious defect that has been fixed involved the the programs inability to exclude some valid Unicode code points that have not been assigned a character if the user selected to exclude unassigned code points (again a similar problem to that described above.)

The good news is these are now fixed, and the new Babel tool also includes support for Unicode surrogate pair characters in the range of U+10000 through U+10FFFF as an option. Also, I included a feature to save the output to a text file rather than having to copy and paste. The installation package include a desktop tool, a DLL for test automation, and the user’s guide and can be found at Testing Mentor.

If you encounter any problem using the tool, or if you have any feedback please let me know. Enjoy!

Written by Bj Rollison

November 13th, 2009 at 9:27 pm

Random Test Data Generation

with 2 comments

Originally Published Wednesday, May 30, 20

I am not a big fan of static test data, so this month’s issue of Software Testing and Performance magazine published an article I wrote outlining one approach for generating random string data (although the basic concepts can be used for generating other types of random data).

Unfortunately, it appears that some of the numbers got a little screwed up and the printer did not superscript the exponents correctly so the numbers in the third paragraph are probably looking pretty strange. So, to clarify, the paragraph should read:

Using only the characters ‘A’ – ‘Z’ the total number of possible character combinations using for a filename with an 8-letter filename and a 3-letter extension is 268 + 263, or 208,827,099,728. If we were assigned to test long filenames on a Windows platform using only ASCII characters (see Table 1), the number of possibilities increases because there are 86 possible characters we can use in a valid filename or extension and a maximum filename length is 251 characters with a 3 character extension is 86251 + 863. Trust me, that is one big number.

(NOTE: There have been several assertions regarding the above formula for determining the number of tests, here is the explanation. Essentially, the Windows platform file system treats the base filename and the file extension as 2 separate components and there is no interaction or dependencies between these two components. (For example, we cannot save a filename as CON.txt, but we can save a filename as myFile.CON.) Since there is no dependencies between the base filename component and the extension component they are treated as 2 independent parameters which would mathematically result in 268 + 263, or 208,828,082,152 tests if we elected to test all possible combinations of the base filename component with a nominal valid extension, then test all possible extension component combinations with a nominal valid base filename. One could argue we could combine the 17576 unique 3-character extension combinations with various combinations of the 8-character base filename component to reduce the overall number of tests by 17576; however I choose not to use that approach and instead test each parameter independently. If we mistakenly assumed dependency or inter-relationship between the base filename and extension components of a filename on the Windows platform testing all combinations (or 268 * 263 (or simply 2611) on a Windows OS would result in approximately 3,670,135,659,905,624 redundant tests (if we could do exhaustive testing). This is where in-depth knowledge of the ‘system’ really pays off.)

Of course, the filename length and extension length is variable. Also, 251 characters assumes a base filename component length from the root directory (it does not take into account the MAXPATH constant). So, the total number of combinations using only ASCII characters is much greater because the base filename component length with a ‘default’ 3-letter extension from the root directory is actually 86251 + 86250 + 86249 + 86248 + 86247 … + 861. Then, of course vary the length of extensions, and the total number of combinations increases even further. But, all this is only to provide some scope the magnitude of the testing problem.

Also, the equivalence class table (Table 2) is simplified and does not include reserved device names. For example, Windows will/should prevent a user from saving a filename of LPT1, or COM6, or CON, etc. (The behavior for saving filenames with strings composed of reserved device names is different on Windows Xp and Windows Vista…Vista finally got it right!).

Unfortunately, I did not get a chance to read the edited copy before print, but I think the basic idea comes through and I hope you find value of using intelligent random test data in your testing and would be interested in hearing your feedback.

Written by Bj Rollison

November 12th, 2009 at 7:24 pm

More on Generating Strings with Random Unicode Characters

without comments

Originally Published Sunday, December 24, 2006

Well, for those of you living outside the Pacific Northwest you are probably unaware of the recent wind storm with winds gusting to 60+ miles per hour that left more than 1 million people on the eastern side of the state without power. The damage was pretty extensive, and since I live in a fairly remote area I was without power for more than 7 days and without the Internet for almost 9 days. I do have a generator, but it hadn’t been used in almost 4 years. Sure, I started it every 6 months for about 15 minutes each time, but after the first full day of operation the generator started doing wierd things. So, during the past week I have become pretty good at fixing generators (mine and my neighbors), tracing electrical systems, troubleshooting furnace problems, splitting a lot of firewood, cutting up fallen trees, and repairing fences.

After the sun set (which is quite early) I had little else to do (other than making sure nobody stole my generator), so between stoking the fire I started developing a DLL for Unicode string generation in automated tests based on the GString utility. While reviewing the data tables I created for the GString utility with the Unicode Handbook I noticed some holes (OK…defects). Some of the boundaries for code ranges that are not assigned to any Unicode script group were incorrect. (That will teach me to use a web page with the listing put together by a web developer rather than using the Unicode handbook.) But, I also found a problem that prevented unassigned code points from being generated even if the Only use assigned code points check box was unchecked.So, the (hopefully final) update to GString is complete, including the GString.DLL! So, along with the massive overhaul of the Unicode data tables, the new GString package available from my personal website also includes a new DLL for anyone needing to generate strings of random Unicode characters in test automation. The GString zip file also includes detailed documentation on the utility and the dll usage. Let me know if you have any questions about the tool or using random string generation in your testing.

Well, now back to (mostly) normal life.

Written by Bj Rollison

November 12th, 2009 at 10:38 am