Testing non-English and non-ASCII (and/or Unicode) support in a product requires tests and test plans that exercise the edge cases in the software. This means using collections of characters and formats known to cause problems and which are engineered to demonstrate that the product is working correctly. This document contains a number of useful small data sets (with reasons for each) for use when performing this kind of testing.
Before testing can begin, the test matrix needs to be planned. A typical test matrix will include platforms and versions for all of the components that are necessary to the application and which will be supported in production. Each of time there can be two or more items of a particular "type", the matrix expands by one dimension.
For example, if you are testing an application that runs on Windows, AIX, HP-UX, Solaris, and Linux, then you have one dimension with five entries. You then need to enumerate versions (do you mean Solaris 5.7, 5.8, 5.9? some combination?). If the application uses a database, then that is a second dimension. A browser might be a third. And so on.
Internationalization testing will typically add these dimensions to the test matrix:
LANG environment variable (or via the GUI login). Databases and other server type components may have their own locale model which must be considered separately. For example, Oracle has NLS_LANG.Notice that each of these "matrix expanding items" can be separately applied to each matrix component. In other words, a matrix entry might be "Solaris 5.9 in the ja_JP.PCK locale" or "French Windows 2000 SP4". And in the same matrix you might have an "Oracle 9.1.2, AL32UTF8 encoding, AMERICAN_AMERICA regional settings, America/New_York time zone".
For "single-box" testing (in which all of the components are hosted on the same system), this may mean merely a bigger matrix. For client-server or client-server-resource configurations this may be much more complex. You can't test every locale and encoding with every component in every combination. You'll have to prune
the matrix to make it more manageable.
Non-English configurations prove that the underlying code is locale-sensitive and can perform normally in different language environments. Generally a mix of European and Far East Asian locales can provide basic assurance that the code will handle these cases, although full platform certification is justified for important markets.
Non-ASCII data proves that the code can correctly process data that is not restricted to ASCII characters. Non-ASCII data includes handling of different character encodings (sometimes erroneously referred to as charsets) as well as specific collections of Unicode data. The edge cases for testing a system's processing capabilities are almost always outside the ASCII repertoire of characters. You should use non-ASCII data in all of your mainline testing, regardless of system configuration. That is: this kind of testing does not require a non-English configuration.
Ultimately, using non-English configurations and non-ASCII data should come as early in the development, regression, and test cycles as possible, since internationalization problems will turn up sooner. "Storing the pain" until late in the cycle is a recipe for missing a release or compromising on quality for international customers.
There are different types of internationalization testing as well. Development teams are sometimes confused by the different ways in which a system can be tested and have coverage gaps as a result. The different types of testing are not blanket processes (although they can be). Generally each of the considerations listed below should be considered for each new product feature to see if it is applicable and test cases written with these issues in mind:
Building internationalized test cases is not much different than building normal test cases.
In testing, you need to find ways to stress the systems's weak spots. International testing focuses on those things that change in international operation. This includes doing the following:
In the sections below you will see tests and data that involve a wide array of Unicode characters. It is important not to focus exclusively on a specific locale, script, or writing system when performing testing. Code must execute correctly in any configuration and process and display data correctly. Although it is tempting to ignore the bidi and complex script examples, these are actually some of the most important to test in any configuration.
In writing your test cases, it is generally best to include the international configuration information in every test, instead of creating segregated sets of tests for international configurations. If you automate your regression suite or smoke tests, you should execute the tests on many (or every) language/locale/encoding configuration to get broad coverage.
What does the specification look like? Here is a specification for one project per-international testing:
| Platform | Oracle 10g R1 | RAC 10g R1 |
|---|---|---|
| Solaris 2.8 | X | N/A |
| Solaris 2.9 | X | N/A |
| HP-UX 11.11 (11i) | X | N/A |
| HP-UX 11i v2 on Itanium | X | N/A |
| HP-UX 11i v2 on PA-RISC | N/A | N/A |
| AIX 5.2 (5L) | X | X |
| AIX 5.3 | N/A | N/A |
| Tru64 5.1b | X | X |
| Windows 2000 Server | X | N/A |
| Windows 2003 Server (32 bit only) | X | N/A |
| RHAS 2.1 | N/A | N/A |
| RHAS 3.0 | X | N/A |
That's a total of eleven configurations tested covering just one encoding/locale combination (U.S. English with US-ASCII). Here's how the same matrix might look with international testing integrated:
| Platform | Oracle 10g R1 | RAC 10g R1 |
|---|---|---|
| Solaris 2.8 | ja PCK, AL32UTF8 | N/A |
| Solaris 2.9 | ja.EUC, JA16EUC | N/A |
| HP-UX 11.11 (11i) | en-US, 8859P1 | N/A |
| HP-UX 11i v2 on Itanium | en-US, AL32UTF8 | N/A |
| HP-UX 11i v2 on PA-RISC | N/A | N/A |
| AIX 5.2 (5L) | ja | en |
| AIX 5.3 | N/A | N/A |
| Tru64 5.1b | en | en |
| Windows 2000 Server | en, fr | N/A |
| Windows 2003 Server (32 bit only) | en | N/A |
| RHAS 2.1 | N/A | N/A |
| RHAS 3.0 | en, ja | N/A |
That's a total of thirteen configurations to test covering six different encoding/locale combinations (three languages, four encodings). With only two additional total configurations you get six times the international coverage.
You will also have to develop more specific configuration documentation, data sets, and other materials suitable for testing the the specific configurations you choose to expand your matrix with. Here are some examples of configuration descriptions:
CONFIGURATION A: OS: Windows 2000 Language: French Locale: French/France Native Encoding: (code page 1252) Install With: Oracle 9i with NLS_LANG=FRENCH_FRANCE.WE8ISO8859P15 - test data contains all values 0x80->0xFF Test Goal: encoding handling Test Goal: date, time, timestamp, timestamp with time zone database types Test Goal: number types CONFIGURATION B: OS: Solaris 5.9 Language: Japanese Locale=ja_JP.EUCJP Native Encoding: EUC-JP Install With: Oracle 10g with NLS_LANG=JAPANESE_JAPAN.JA16SJIS Test Goal: Japanese platform certification CONFIGURATION C: OS: Windows XP SP2 Language: U.S. English Locale: English/United States Native Encoding: (code page 1252) Install With: Oracle 8.1.7 with NLS_LANG=AMERICAN_AMERICA.AL32UTF8 Test Goal: Unicode handling, encoding handling CONFIGURATION D: Machine 1 -- OS: Solaris 5.8 Language: English Locale: en_US.UTF-8 Native Encoding: UTF-8 Time Zone: America/New_York (GMT-05:00) Machine 2: -- OS: Solaris 5.8 Language: English Locale: en_US.UTF-8 Native Encoding: UTF-8 Time Zone: America/Los_Angeles (GMT-08:00) Test Goal: Cross time zone handling
The basic kinds of non-ASCII data include:
There are several ways to assemble data sets that use non-ASCII values.
First, you'll want to create specific tests for non-ASCII handling.
If you are testing with a specific encoding (such as Latin-1, Shift-JIS, etc.), then you need to create data that represents the full range of characters in that encoding. This includes provoking "state shifts" in stateful encodings.
You'll also want to figure out size limitations in particular and test these as well. This kind of testing focuses on putting non-ASCII values into the system and then testing them immediately after the operation is complete, that is, specifically testing non-ASCII and encoding handling. For example, if you are testing a product that writes to an Oracle database, you know that a varchar2(30) field can hold 30 bytes. Writing different multibyte values can test whether truncation works correctly.
But you'll also want to create non-ASCII data for complete system testing. That is, create non-ASCII data to pass through the whole system during regular testing so that any encoding handling problems in the application will surface as testing proceeds. The most common data used for this is pseudo-translated data. In pseudo-translated data, the values are not selected for their character properties. They are chosen instead to ensure that poor encoding handling, font problems, hard coded strings, and so on are visible plainly in the application.
It is also important to have values that are used to ensure that a system can handle specific kinds of display issues, such as bidi text, complex scripts, and the like.
As with encoding testing, you face differnt goals at different points in the development cycle. True product certification requires localized product (the real shrink-wrapped local version, not the English version configured to run in that locale). Many products today produce global binaries, in which the real local product is not different in any way from the U.S. English product. But this is a relatively recent development in software localization and distinguishing which is which can be difficult. In addition, even though the binaries are identical, data distributed with the systems may be different.
The classic example is the difference in Windows between the Administrator
account in English and the Administrateur
account in French: the account information is the same, but the name varies between language versions. Setting the locale in English Windows won't uncover code that assumes the name of the account is static as a defect.
Before starting you should install the necessary support for performing testing onto your computer systems. This includes:
Fonts: You should install Unicode fonts that cover the complete repertoire. This includes the Arial Unicode MS font included with MS Office and James Kass's Code2000 and Code2001 fonts.
Keyboards: These allow you to enter non-ASCII data. On Windows install the Japanese IME, the French keyboard and the Chinese-Taiwan Unicode keyboard. This last item allows you to enter Unicode code points by typing their hex value.
Testing basic non-ASCII support requires some knowledge of proper behavior in a system.
The main thing you are looking for is data degradation. This includes several different symptoms that you should learn to recognize. The basic problems are:

Tofu is sometimes what you see when characters have been turned into the mojibake: that's because unassigned characters or unusual characters often won't be in your current font. Be sure to distinguish between expected tofu and garbage.

Two different kinds of mojibake in the same screen. Here Japanese Shift-JIS characters are presented in the OEM (DOS) code page on the left and the ANSI (Windows) code page on the right. Notice that some Japanese-looking characters may display, but there is some evidence of "junk". The Japanese-looking characters are meaningless character sequences, like typing "xyzzydke" on your keyboard in English.
Here is a picture of some tofu to get us started:

Consider if was had some input that looke like this: 文字化け€àáßèëœ
If we perform a conversion from Unicode to a legacy encoding then several of the above items may occur. Proper conversions of this type result in individual characters being converted to individual question marks. The question mark is a substitution character. Basically it means: this character has no mapping in the target encoding.
Latin-1 (ISO8859-1) ?????àáßèë?
ISO8859-15 (Euro support) ????€àáßèëœ
Shift-JIS (a Japanese character set) 文字化け???????
Mozibake reading UTF-8 as Latin-1: æ–‡å—化ã‘€à áßèëœ
Mozibake reading SJIS as Latin-1: “ú–{Œê•¶Žš‰»‚¯
It goes both ways: reading Latin-1 as SJIS: ????、珮゚齏ス
It goes both ways: reading UTF-8 as SJIS: 譁�蟄怜喧縺鯛ぎテ�テ。テ淌ィテォナ�
The last example shows how it can be hard to tell. Most of the characters are at least Japanese-looking. If you don't speak Japanese, how can you tell? You should keep track of the characters being used for testing (their count and appearance) so that you can be sure that what you have isn't mojibake.
It is worth noting that you can see some funky garbage in the page (the Unicode replacement character U+FFFD) and that there are sometimes runs of characters with a lot of punctuation. But don't rely on that. Use known sets of characters to ensure that your testing gets what you sent. Common mojibake errors occur or sometimes only appear at the first or last character on a line or in a control (where a single multibyte value is getting munged).
Test standard Western European characters (the ISO 8859-1 repertoire). This includes accented latin script letters in the Unicode range U+00A0 through U+00FF.
Test for Euro and Windows-1252 support. Latin-1 doesn't include the Euro symbol (U+20AC). Microsoft also include 31 other symbols in the C1 control character range, including slightly less common French characters such as guillemet quotes and the ae/oe ligatures.
=80 U+20AC EURO SIGN =82 U+201A SINGLE LOW-9 QUOTATION MARK =83 U+0192 LATIN SMALL LETTER F WITH HOOK =84 U+201E DOUBLE LOW-9 QUOTATION MARK =85 U+2026 HORIZONTAL ELLIPSIS =86 U+2020 DAGGER =87 U+2021 DOUBLE DAGGER =88 U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT =89 U+2030 PER MILLE SIGN =8A U+0160 LATIN CAPITAL LETTER S WITH CARON =8B U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK =8C U+0152 LATIN CAPITAL LIGATURE OE =8E U+017D LATIN CAPITAL LETTER Z WITH CARON =91 U+2018 LEFT SINGLE QUOTATION MARK =92 U+2019 RIGHT SINGLE QUOTATION MARK =93 U+201C LEFT DOUBLE QUOTATION MARK =94 U+201D RIGHT DOUBLE QUOTATION MARK =95 U+2022 BULLET =96 U+2013 EN DASH =97 U+2014 EM DASH =98 U+02DC SMALL TILDE =99 U+2122 TRADE MARK SIGN =9A U+0161 LATIN SMALL LETTER S WITH CARON =9B U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK =9C U+0153 LATIN SMALL LIGATURE OE =9E U+017E LATIN SMALL LETTER Z WITH CARON =9F U+0178 LATIN CAPITAL LETTER Y WITH DIAERESIS
Test Far East Asian characters primarily using Japanese. Japanese writing consists of four scripts used together:
There are various ranges of characters to test when working with Japanese. First there are "width" distinctions in the text. There are characters whose underlying representation in legacy (non-Unicode) encodings is two bytes long. These are called zenkaku
or "wide" characters. The opposite of these are characters whose underlying representation takes one byte. These are called hankaku
or "narrow" characters.
Zenkaku characters include kana of both types. There is also a set of compatibility characters that represent the ASCII range. For example: ABCDEFG
as well as かたかなひらがな感じ①②
etc.
Hankaku characters usually refer to single-byte kanakana characters (although ASCII can be referred to as hankaku). For example: カタカナ
. Notice that the characters actually appear to be narrow!
One common use for hankaku characters is in menu items to save space.
The narrow katakana characters are not encoded in EUC-JP (an encoding commonly used on Unix). Be aware of configurations that filter out hankaku characters or convert them to their wide equivalent.
Japanese also features a few combining marks called "kuten". Some characters are "pre-composed" (that is, a single Unicode character incorporates the kuten mark) and others are distinct. The kuten must be handled correctly. Here are examples of both:
If you process data using legacy encodings, then you should look for famous tests of trailing byte issues, such as a trialing byte of 0x5C (backslash). There is a list of Unicode code points that in Shift-JIS have a trailing byte of 0x5C later in this document. A sequence that tests this can be easily typed. The Japanese word for "source code" is "so-su" written in katakana. Type "so-su" and you've typed a sequence with a 0x5C trailing byte (like this: ソース).
For more information on keyboarding, see Learn to Type Japanese (and other languages)
.
On Far East Asian operating systems, the Unicode character U+005C (yes, it is backslash) is often used in the original legacy version of ASCII (or, more properly ISO 646) to represent the local currency symbol. When Japanese, Chinese, or Korean users type a "\" they expect to see their local equivalent symbol (the yen U+00A5 "¥", yuan, or won sign U+20A9 "₩" respectively). Unicode has separate code points for these characters and it is important to differentiate the various applications of each character. For example, in path names in Windows the currency symbol should be shown in this locales, but in data you want to show the real Unicode value.
Certain characters are mapped differently by different vendors. In Japanese, for example, there are several characters that Microsoft Windows (code page 932) maps differently to/from Unicode than other systems (such as Oracle JA16SJIS or Solaris PCK versions of Shift-JIS) do. Here is a table of the mappings:
| JISX0201 | Not Windows | Windows |
| 0x2141 ('~') | U+301C (Wave Dash) | U+FF5E (Fullwidth Tilde) |
| 0x2142 ('∥') | U+2016 (Double Vertical Line) | U+2225 (Parallel To) |
| 0x215d ('-') | U+2212 (Minus Sign) | U+FF0D (Fullwidth Hyphenminus) |
| 0x224c ('¬') | U+00AC (Not Sign) | U+FFE2 (Fullwidth Not Sign) |
These four characters can result in a Unicode program losing data. For example, if you convert the same byte sequence from Shift-JIS to Windows Unicode for the "wave dash" character, you'll get U+FF5E. If you store U+FF5E in an Oracle database, you'll get a question mark (bad conversion) because Oracle doesn't map U+FF5E to wave dash.
Bidirectional languages are those languages customarily written from right-to-left, such as Arabic and Hebrew.
Complex scripts are those whose characters change shape or are composed contexually. Complex scripts include Arabic, but also the Indic scripts (Hindi, Gujurati, Kanaada, Bengali, Gumurkhi, etc.), related scripts such as Thai, and a few oddities such as Vietnamese. Note that this last language is written in the Latin script!
Although software doesn't always provide full support for bidi display (such as reversing the screen layout or putting the scroll bar on the left side), support for these types of scripts is still important for internationalized products. The product should support correct Unicode Bidirectional Algorithm (UBA) display of bidi text and the bidi control characters (such as the RLM and LRM characters). Complex scripts should be displayed correctly and not broken by poor coding choices.
A good source for text to use in tests is the W3C I18N GEO page located here: GEO Tests. In many of the pages there are graphics of what the text should look like, followed by content you can copy (see especially the bidi and whitespace tests).
Here are some examples of complex script texts that you can use to validate display and text handling:

A good source for texts in multiple languages (some of the above are taken from there) is: BBC Worldservice
Unicode offers an array of additional complexities that must be tested. The include:
U+FFFF. In UTF-16 these are represented by a pair of surrogate code points. Support for these characters really isn't optional, especially given that there are 30+K of these characters associated with the Chinese writing system and required for full support of GB18030. You may not be able to display these characters in a Windows context, but code should handle these correctly and not damage or destroy the values. Ideally supplemental characters are treated as a single character visually, even if the internal representation uses UTF-16.These types of characters need to be tested even though fonts may be difficult to come by. The display of tofu is reasonable in this case, although the right amount of tofu (one block, not two, for supplemental characters).
To fully test whether a system "supports Unicode", you need to test the different kinds of rendering and processing that can occur. The things you need to test include selection of text; cursor movement in the text; line and word breaking; storage and retrieval; upper- and lowercasing of text; comparison of text; and so forth. The following dataset is designed to exercise these capabilities. Some of the tests cases described later will refer to a specific test in this table.
| Test | # | Characters | UTF-16 Character Values | Comments |
| Supplemental Characters | 1 | U+D800 U+DC00 U+D800 U+DC01 | the first two supplemental characters as surrogate pairs. In UTF-16 this is four 16-bit code points ("characters"). In UTF-8 this is two four-byte characters. (corresponds to Oracle encoding "AL32UTF8"). In CESU-8 this is four three-byte characters (corresponds to Oracle encoding "UTF8") | |
| Combining Marks and Accents | 2 | àéîōũ | a U+0300 e U+0301 i U+0302 o U+0304 u U+0303 | Combining marks for vowels (this example is not realistic) |
| 3 | você nós mãe avô irmã criança | voc U+00EA n U+00F3 s m U+00E3 e av U+00F4 irm U+00E3 crian U+00E7 a | Portuguese (DOS 860 test) (words for: you, we, mother, grandfather, brother, child) | |
| 4 | €ŒœŠš™©‰ƒ | U+20acU+0152U+0153U+0160U+0161U+2122U+00a9U+2030U+0192 | Windows-1252 test (only one of these is a Latin-1 character--copyright) | |
| 5 | ışık bir İyi Günler | U+0131 U+015F U+0131 U+006B U+0062 U+0069 U+0072 U+0130 U+0079 U+0069 | Turkish (dotted and dotless letter "i") ("light", "one", "good day"). NB> Dotted lowercase i upper cases to U+0130 (Capital I with dot above), while uppercase I lowercases to dotless lowercase i (U+0131) | |
| 6 | がざばだぱか゛さ゛た゛は゜ | U+309B U+309C | dakuten and handakuten marks1: both precomposed and combining forms | |
| 7 | אִ͏ַ | U+05D0 U+05B4 U+034F U+05B7 | Combining Grapheme Joiner character (the sequence is from the Unicode CGJ FAQ) | |
| 8 | ≠q̌ | U+D84C U+DFB4 U+2260 U+0071 U+030C | Supplemental, plus combining marks (from CharMod Section 6.1) | |
| Bidi with Latin | 9 | abcאבגדabc | U+05D1 U+05D2 U+05D3 U+05D4 | left-right-left |
| 10 | אבגדabcאבגד | U+05d0U+05d1U+05d2U+05d3abcU+05d0U+05d1U+05d2U+05d3 | right-left-right | |
| 11 | אבגד 012 אבגד | U+05d0U+05d1U+05d2U+05d3 012 U+05d0U+05d1U+05d2U+05d3 | right - weak -right | |
| 12 | 012 אבגד 012 | 012 U+05d0U+05d1U+05d2U+05d3 012 | weak-right-weak | |
| Bidi with Asian | 13 | אבגדソースאבגד | U+05d0 U+05d1 U+05d2 U+05d3 U+30bd U+30fc U+30b9 U+05d0 U+05d1 U+05d2 U+05d3 | right-left-right |
| 14 | ソースאבגדそーす | U+30bd U+30fc U+30b9 U+05d0 U+05d1 U+05d2 U+05d3 U+305d U+30fc U+3059 | left-right-left | |
| 15 | ||||
| Complex Scripts | 16 | สวัสดี | U+0e2aU+0e27U+0e31U+0e2aU+0e14U+0e35 | Thai (greeting) |
| 17 | டாஹ்கோ | U+0B9F U+0B9E U+0B99 U+0BCD U+0B95 U+0BCB | Tamil (from CharMod Appendix B) | |
| 18 | بِسْمِ اللّهِ الرَّحْمـَنِ الرَّحِيمِ | U+0628 U+0650 U+0633 U+0652 U+0645 U+0650 U+0627 U+0644 U+0644 U+0651 U+0647 U+0650 U+0627 U+0644 U+0631 U+0651 U+064e U+062d U+0652 U+0645 U+0640 U+064e U+0646 U+0650 U+0627 U+0644 U+0631 U+0651 U+064e U+062d U+0650 U+064a U+0645 U+0650 | Arabic (first line of the Qur'an). this text has the vowels in it and demonstrates the full complexity of Arabic text. | |
| Numeric Shaping | 19 | 01234 | ||
| 20 | عدد مارس ١٩٩٨ | U+0639U+062fU+062f U+0645U+0627U+0631U+0633 U+0661U+0669U+0669U+0668 | Arabic (from CharMod) "1998" at the end (Yes: the end. Remember that Arabic is read right-to-left) | |
| Bidi Controls and Mirroring | 21 | |||
| 22 | ||||
| Private Use | 23 | | U+E000 U+E001 U+E002 U+E003 U+E004 | |
| Common Scripts and Encodings | 24 | Слава Жанна Ювеналий Ярополк | Cyrillic (Russian, Ukrainian, Serbian, and others) Russian names. | |
| 25 | Greek | |||
| 26 | Latin-1 | |||
| 27 | Latin-2 |
Use data sets: 1-18
Display the text. Select the text using the mouse from both the left and right sides. The text should be selected one glyph at a time (it shouldn't jump around or allow partial character selection, you shouldn't see any tofu jump into view, etc.). Bidi text should select in logical order (that is, for texts such as #11, you should be able to have two separate regions selected at the same time). For an example of discontiguous visual selection see: [CharMod] Section 3.3.1.
Character Model for the World Wide Web 1.0: Fundamentals
Roman Czyborra's Alphabet Soup shows various encodings and offers text downloads with Unicode code points enumerated.
Omniglot gives examples of different writing systems and explains them.
1kuten, dakuten, handakuten: Japanese tone modifying marks. These are usually precomposed and are only used with kana (either katakana or hiragana). Here we present the unprecomposed (combining) forms as a test. For a more thorough explanation, including how to type these naturally in the IME, see Biography.ms.