Skip to content

Testing with Surrogate Code Points

It has been a very long time since my last blog post; too long. I have been extremely busy this past year and have been doing a lot of juggling. In some cases I tried juggling too many balls and dropped a few balls. But I have learned quite a bit during my transition from “academia” back into the product groups here at Microsoft, and I have learned a lot about what it means to be a great test lead shipping world class software. Despite the bumps I love my new career direction in Windows Phone team and I finally feel things are coming under control. So, it is time now to once again share some of the things I’ve learned and continue to learn in my journey as a software tester.

Let’s start with a discussion of a problem I came across the other day while doing some testing around posts and feeds (uploads and downloads to social networks such as Twitter, Facebook, etc). Over the years I have frequently mentioned testing with Unicode surrogate code points in strings and using Babel string generation tool to help increase test coverage by producing variable test data composed of characters from across the Unicode spectrum.

Surrogate pairs are often problematic in string parsing algorithms. Unlike “typical” 16-bit Unicode characters in the base multilingual plane (BMP or Plane 0) surrogate pairs are composed of 2 16-bit Unicode code points that are mapped to represent a single character (glyph). (See definition D75 Section 3.8, Surrogates in The Unicode Standard Version 6.1) Surrogate code points typically cause problems because many string parsing algorithms assume 1 character/glyph is 1 code point value which can lead to character (data) corruption or string buffer miscounts (which can sometime lead to buffer overflow errors).

twitter websiteAs an example of string buffer miscounting let’s take a look at Twitter. It is generally well known that Twitter has a character limit of 140 characters. But, when a sting of 140 characters contains surrogate pairs it seems that Twitter doesn’t know how to count them correctly and displays a message stating, “Your Tweet was over 140 characters. You’ll have to be more clever.”

Well Twitter…I was being clever! I was clever enough to expose an error path caused by a mismatch between the code that counts character glyphs and the code that realizes there are more than 140 16-bit character code points.

Although there is a counting mismatch at least Twitter preserved the character glyphs for surrogate code points in this string.

tweetdeckUnfortunately, TweetDeck is what I refer to as a globalization stupid application. TweetDeck doesn’t have a problem with character count mismatches because it breaks horribly when surrogate code points are used in a string.

There is some really wicked character parsing when the string is pasted into TweetDeck. TweetDeck solves the character count problem by blocking any character that is not an ASCII character from the string. (Note: the “W” character is a full-width Latin character U+FF37 not the Latin W U+0057.)

I find it hard to believe that a modern application would limit the range of characters it allows customers to use; especially an application targeted towards users of the world wide web.

2 Comments

  1. Johan Hoberg wrote:

    Interesting article!

    It will be very interesting to hear more on your thoughts about testing in the context of Windows Phone.

    Monday, May 21, 2012 at 11:50 PM | Permalink
  2. Aruna wrote:

    All the best for your transition from academia to product career and congratulations!

    In real-time it is very rare for customers to enter such characters into twitter, what are your thoughts on that? Nevertheless you really know how to break the system, good find!

    [Bj's Reply] Hi Aruna, thank you. It is great to hear from you again.

    Regarding the probability of customers using these characters into twitter or other social messaging is going to increase in the near future. For example, emoji/emoticons (http://www.unicode.org/charts/PDF/U1F600.pdf) are quite popular in text messages and Unicode 6.1 spec has defined a lot of new emoji/emoticon characters that are actually surrogate pair code points (see http://www.unicode.org/charts/PDF/U1F300.pdf) and many devices are starting to incorporate surrogate pair code points into their font sets.

    Also, many of these surrogates are Chinese characters, and the Chinese market is expanding at a pretty wild pace.

    Wednesday, June 6, 2012 at 6:22 AM | Permalink

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*

romulus.eustolia@mailxu.com