Skip to content

Decoding the Secrets in Unicode Strings

At the end of each week one of the last things I do is open my junk mail folder in Outlook and check to see if an email was moved there inadvertently before deleting all the spam that let’s me know that I’ve won the lottery in Ethiopia, or that my long lost relative in Chechnya left me 19 bazillion Euros, or the countless discount drug offerings. So, as I was going through my Friday evening spam mail deletion ritual I noticed a subject line that was a bit unusual. Before you jump to any incorrect conclusions it wasn’t about appendage enlargement, or free internet dating services. The email title was in Arabic, but included a “box” character at the beginning of the string.

image

Now, I don’t read Arabic, but I am pretty good at noticing globalization bugs when they are staring me right in the face. The “box” character (actually a glyph) in a Unicode string either represents an Unicode code point that is unassigned (it doesn’t have a character associated with that code point value), or the system doesn’t have a font that maps a glyph (the character we see) to that particular Unicode code point. So, curiosity got the better of me, and I decided to investigate a bit. The first thing I did was to right click on the email subject line and paste it into Notepad and notice that the “box” glyph did not appear.

image

imageA few years ago I developed a utility for decoding Unicode Strings aptly called “String Decoder” and also wrote a post that discusses the tool. So, I launched String Decoder and copied the Arabic string from Notepad and pasted it into the String Decoder tool.

The first thing I notice when reading through the list of Unicode code point values is the value U+FEFF. Now, I happen to know that this particular value is a byte order mark (BOM). This seems pretty unusual and ask myself how a BOM character could get inserted in a string. So, I look up the character in the Unicode Charts and discover that in the Arabic Presentation Forms-B character set this was a special character for a zero width no-break space that as been deprecated. Ah, so the Unicode BOM code point value appearing in the string is not so magical after all!

Interestingly enough, the U+FEFF character only displays as a “box” glyph in the subject line in the Junk E-mail folder. When I copied the email message from the Junk folder to my Inbox (or other folder) the code point U+FEFF is treated as a zero width non-breaking space character so no box glyph appears. This is due to the fact that when an email gets shunted into the Junk E-mail folder “links and other functionality have been disabled.” In other words, it is plain-text.

I previously also wrote about using “real world” test data for globalization testing, and this is another example of “real-world” data can be useful in testing text inputs and outputs to evaluate how unexpected character code points in a string are parsed or handled. I think this also bolsters the argument to include some amount of test data randomization using tools such as the Babel tool in globalization testing to potentially test for other unexpected characters or sequences of mixed Unicode characters.

2 Comments

  1. Hi Bj,
    I think you have made a very relevant point while mentioning about Test data Randomization. I do use Babel for some of my testing. Quite helpful.

    Regards,
    Anuj

    [Bj's Reply] Hi Anuj, Great to hear from you again. Thank you for reiterating my point about random test data, but I am also finding some interesting anomolies in ‘real-world’ data lately in some unusual situations and sometimes understanding that data and the context might help us identify patterns to pursue to identify other problems.

    Thursday, September 8, 2011 at 6:23 PM | Permalink
  2. Hi Bj,
    I too am a big fan of “Real World” data but you right out the basic problem being in identifying patterns around the context, which is rather difficult but not impossible i guess. In the past i have used the classes of localized test data (approach described in URL below) to fair degree of success but this approach will have a gap from real world data.

    http://anujmagazine.blogspot.com/2010/05/uncovering-myths-about-globalization.html

    Regards,
    Anuj

    Wednesday, September 21, 2011 at 1:27 AM | Permalink

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*

apalategui_waneta@mailxu.com welablossom@mailxu.com