It has been a very long time since my last blog post; too long. I have been extremely busy this past year and have been doing a lot of juggling. In some cases I tried juggling too many balls and dropped a few balls. But I have learned quite a bit during my transition from “academia” back into the product groups here at Microsoft, and I have learned a lot about what it means to be a great test lead shipping world class software. Despite the bumps I love my new career direction in Windows Phone team and I finally feel things are coming under control. So, it is time now to once again share some of the things I’ve learned and continue to learn in my journey as a software tester.
Let’s start with a discussion of a problem I came across the other day while doing some testing around posts and feeds (uploads and downloads to social networks such as Twitter, Facebook, etc). Over the years I have frequently mentioned testing with Unicode surrogate code points in strings and using Babel string generation tool to help increase test coverage by producing variable test data composed of characters from across the Unicode spectrum.
Surrogate pairs are often problematic in string parsing algorithms. Unlike “typical” 16-bit Unicode characters in the base multilingual plane (BMP or Plane 0) surrogate pairs are composed of 2 16-bit Unicode code points that are mapped to represent a single character (glyph). (See definition D75 Section 3.8, Surrogates in The Unicode Standard Version 6.1) Surrogate code points typically cause problems because many string parsing algorithms assume 1 character/glyph is 1 code point value which can lead to character (data) corruption or string buffer miscounts (which can sometime lead to buffer overflow errors).
As an example of string buffer miscounting let’s take a look at Twitter. It is generally well known that Twitter has a character limit of 140 characters. But, when a sting of 140 characters contains surrogate pairs it seems that Twitter doesn’t know how to count them correctly and displays a message stating, “Your Tweet was over 140 characters. You’ll have to be more clever.”
Well Twitter…I was being clever! I was clever enough to expose an error path caused by a mismatch between the code that counts character glyphs and the code that realizes there are more than 140 16-bit character code points.
Although there is a counting mismatch at least Twitter preserved the character glyphs for surrogate code points in this string.
Unfortunately, TweetDeck is what I refer to as a globalization stupid application. TweetDeck doesn’t have a problem with character count mismatches because it breaks horribly when surrogate code points are used in a string.
There is some really wicked character parsing when the string is pasted into TweetDeck. TweetDeck solves the character count problem by blocking any character that is not an ASCII character from the string. (Note: the “W” character is a full-width Latin character U+FF37 not the Latin W U+0057.)
I find it hard to believe that a modern application would limit the range of characters it allows customers to use; especially an application targeted towards users of the world wide web.