International Testing Basics

Testing non-English and non-ASCII (and/or Unicode) support in a product requires tests and test plans that exercise the edge cases in the software. This means using collections of characters and formats known to cause problems and which are engineered to demonstrate that the product is working correctly. This document contains a number of useful small data sets (with reasons for each) for use when performing this kind of testing.

Matrix Planning

Before testing can begin, the test matrix needs to be planned. A typical test matrix will include platforms and versions for all of the components that are necessary to the application and which will be supported in production. Each of time there can be two or more items of a particular "type", the matrix expands by one dimension.

For example, if you are testing an application that runs on Windows, AIX, HP-UX, Solaris, and Linux, then you have one dimension with five entries. You then need to enumerate versions (do you mean Solaris 5.7, 5.8, 5.9? some combination?). If the application uses a database, then that is a second dimension. A browser might be a third. And so on.

Internationalization testing will typically add these dimensions to the test matrix:

Locale
The specific regional option settings for the operating system or component. On Windows this is set via the Regional Options control panel. On the UNIXen this is set via the LANG environment variable (or via the GUI login). Databases and other server type components may have their own locale model which must be considered separately. For example, Oracle has NLS_LANG.
Character Encoding
The character encoding used applies to every string or text resource. Text files, databases, URLs, XML documents, and so on have an encoding. The default encoding for a configuration is usually determined by the runtime locale, but this should not be the only encoding that matters in a given configuration. Testing encoding support means more than just using that encoding: you have to have data in that encoding the explores the boundaries of the encoding. In other words, if you are using ISO 8859-1 (Latin-1) to test your software, use accented letters not just the ASCII letters.
Time Zone
Although time zone differences are not necessarily "internationalization" issues, they do vary based on geography. Not checking for time zone support can be a source of runtime errors, so plan this into the test matrix as appropriate. For more information see: It's About Time.
Localization
Some components, such as browsers, operating systems, databases, and so forth, provide more than just locale settings. They are provided with localized (translated) user interfaces. Localized versions of a product may behave differently or expose bad assumptions in code that interfaces with them. For example, when Windows 95 shipped, many programs hardcoded the path "Program Files" in their installation, not realizing that this directory was called something different in various localized versions of Windows. Some products provide global binaries (that is the same code and configuration is shipped globally), but what this means may vary by product.

Notice that each of these "matrix expanding items" can be separately applied to each matrix component. In other words, a matrix entry might be "Solaris 5.9 in the ja_JP.PCK locale" or "French Windows 2000 SP4". And in the same matrix you might have an "Oracle 9.1.2, AL32UTF8 encoding, AMERICAN_AMERICA regional settings, America/New_York time zone".

For "single-box" testing (in which all of the components are hosted on the same system), this may mean merely a bigger matrix. For client-server or client-server-resource configurations this may be much more complex. You can't test every locale and encoding with every component in every combination. You'll have to prune the matrix to make it more manageable.

Non-English configurations prove that the underlying code is locale-sensitive and can perform normally in different language environments. Generally a mix of European and Far East Asian locales can provide basic assurance that the code will handle these cases, although full platform certification is justified for important markets.

Non-ASCII data proves that the code can correctly process data that is not restricted to ASCII characters. Non-ASCII data includes handling of different character encodings (sometimes erroneously referred to as charsets) as well as specific collections of Unicode data. The edge cases for testing a system's processing capabilities are almost always outside the ASCII repertoire of characters. You should use non-ASCII data in all of your mainline testing, regardless of system configuration. That is: this kind of testing does not require a non-English configuration.

Ultimately, using non-English configurations and non-ASCII data should come as early in the development, regression, and test cycles as possible, since internationalization problems will turn up sooner. "Storing the pain" until late in the cycle is a recipe for missing a release or compromising on quality for international customers.

Types of Testing

There are different types of internationalization testing as well. Development teams are sometimes confused by the different ways in which a system can be tested and have coverage gaps as a result. The different types of testing are not blanket processes (although they can be). Generally each of the considerations listed below should be considered for each new product feature to see if it is applicable and test cases written with these issues in mind:

  1. Localizability Testing. This is the easiest to understand and perform. If the product team has externalized the strings and messages in a product, pseudo-translate the strings into a "fake French" or "fake Japanese" and run the product in the resulting locale. Any purely English strings are bugs. In addition, any functionality problems will generally be the result of having translated a string that must remain in English.
  2. NLS or Enabling Testing. Test the product (within an English user interface and messages) on a non-English configured system or systems. The tests performed under this type of scenario inlcude whether numbers, dates, times, lists, names, currencies, and so forth are correctly formatted or displayed for the given locale. Input of these types of locale-affected data should produce the correct results for the locale. Processing should be correct. And so forth.
  3. Encoding and Character Handling. Test the product for support of non-ASCII values. This includes: input and output of text files or textual data; processing and storage of data internally; display of non-ASCII values; support for encoding conversions; an so forth.
  4. Cross-Time Zone Testing. Test the product for support of multiple time zones, calendars, and other date-and-time related behavior.
  5. Localization Testing. Check that the translation of the product is appropriate and correct and that the product still functions correctly following the localization (translation) process.

Building Test Cases

Building internationalized test cases is not much different than building normal test cases.

In testing, you need to find ways to stress the systems's weak spots. International testing focuses on those things that change in international operation. This includes doing the following:

In the sections below you will see tests and data that involve a wide array of Unicode characters. It is important not to focus exclusively on a specific locale, script, or writing system when performing testing. Code must execute correctly in any configuration and process and display data correctly. Although it is tempting to ignore the bidi and complex script examples, these are actually some of the most important to test in any configuration.

In writing your test cases, it is generally best to include the international configuration information in every test, instead of creating segregated sets of tests for international configurations. If you automate your regression suite or smoke tests, you should execute the tests on many (or every) language/locale/encoding configuration to get broad coverage.

What does the specification look like? Here is a specification for one project per-international testing:

PlatformOracle 10g R1RAC 10g R1
Solaris 2.8X N/A
Solaris 2.9X N/A
HP-UX 11.11 (11i)X N/A
HP-UX 11i v2 on ItaniumX N/A
HP-UX 11i v2 on PA-RISCN/AN/A
AIX 5.2 (5L)X X
AIX 5.3N/AN/A
Tru64 5.1bX X
Windows 2000 ServerX N/A
Windows 2003 Server (32 bit only)X N/A
RHAS 2.1N/AN/A
RHAS 3.0X N/A

That's a total of eleven configurations tested covering just one encoding/locale combination (U.S. English with US-ASCII). Here's how the same matrix might look with international testing integrated:

PlatformOracle 10g R1RAC 10g R1
Solaris 2.8ja PCK, AL32UTF8 N/A
Solaris 2.9ja.EUC, JA16EUCN/A
HP-UX 11.11 (11i)en-US, 8859P1N/A
HP-UX 11i v2 on Itaniumen-US, AL32UTF8N/A
HP-UX 11i v2 on PA-RISCN/AN/A
AIX 5.2 (5L)ja en
AIX 5.3N/AN/A
Tru64 5.1ben en
Windows 2000 Serveren, fr N/A
Windows 2003 Server (32 bit only)en N/A
RHAS 2.1N/AN/A
RHAS 3.0en, ja N/A

That's a total of thirteen configurations to test covering six different encoding/locale combinations (three languages, four encodings). With only two additional total configurations you get six times the international coverage.

You will also have to develop more specific configuration documentation, data sets, and other materials suitable for testing the the specific configurations you choose to expand your matrix with. Here are some examples of configuration descriptions:

CONFIGURATION A:
OS: Windows 2000
Language: French
Locale: French/France
Native Encoding: (code page 1252)

Install With:
Oracle 9i with NLS_LANG=FRENCH_FRANCE.WE8ISO8859P15
- test data contains all values 0x80->0xFF

Test Goal: encoding handling
Test Goal: date, time, timestamp, timestamp with time zone database types 
Test Goal: number types 

CONFIGURATION B:
OS: Solaris 5.9
Language: Japanese
Locale=ja_JP.EUCJP
Native Encoding: EUC-JP

Install With: 
Oracle 10g with NLS_LANG=JAPANESE_JAPAN.JA16SJIS

Test Goal: Japanese platform certification

CONFIGURATION C:
OS: Windows XP SP2
Language: U.S. English
Locale: English/United States
Native Encoding: (code page 1252)

Install With:
Oracle 8.1.7 with NLS_LANG=AMERICAN_AMERICA.AL32UTF8

Test Goal: Unicode handling, encoding handling

CONFIGURATION D:
Machine 1
--
OS: Solaris 5.8
Language: English
Locale: en_US.UTF-8
Native Encoding: UTF-8
Time Zone: America/New_York (GMT-05:00)

Machine 2:
--
OS: Solaris 5.8
Language: English
Locale: en_US.UTF-8
Native Encoding: UTF-8
Time Zone: America/Los_Angeles (GMT-08:00)

Test Goal: Cross time zone handling

Basic Tests Using Non-ASCII Data

The basic kinds of non-ASCII data include:

  1. Basic non-ASCII collections (Western European, Far East Asian repertoires)
  2. Combining Marks
  3. Supplemental characters and surrogate pairs
  4. Complex script and bidi handling
  5. Encoding transformations

There are several ways to assemble data sets that use non-ASCII values.

Specific Encoding Handling Tests

First, you'll want to create specific tests for non-ASCII handling.

If you are testing with a specific encoding (such as Latin-1, Shift-JIS, etc.), then you need to create data that represents the full range of characters in that encoding. This includes provoking "state shifts" in stateful encodings.

You'll also want to figure out size limitations in particular and test these as well. This kind of testing focuses on putting non-ASCII values into the system and then testing them immediately after the operation is complete, that is, specifically testing non-ASCII and encoding handling. For example, if you are testing a product that writes to an Oracle database, you know that a varchar2(30) field can hold 30 bytes. Writing different multibyte values can test whether truncation works correctly.

General Non-ASCII Testing

But you'll also want to create non-ASCII data for complete system testing. That is, create non-ASCII data to pass through the whole system during regular testing so that any encoding handling problems in the application will surface as testing proceeds. The most common data used for this is pseudo-translated data. In pseudo-translated data, the values are not selected for their character properties. They are chosen instead to ensure that poor encoding handling, font problems, hard coded strings, and so on are visible plainly in the application.

It is also important to have values that are used to ensure that a system can handle specific kinds of display issues, such as bidi text, complex scripts, and the like.

Basic Tests using Non-English Configurations

As with encoding testing, you face differnt goals at different points in the development cycle. True product certification requires localized product (the real shrink-wrapped local version, not the English version configured to run in that locale). Many products today produce global binaries, in which the real local product is not different in any way from the U.S. English product. But this is a relatively recent development in software localization and distinguishing which is which can be difficult. In addition, even though the binaries are identical, data distributed with the systems may be different.

The classic example is the difference in Windows between the Administrator account in English and the Administrateur account in French: the account information is the same, but the name varies between language versions. Setting the locale in English Windows won't uncover code that assumes the name of the account is static as a defect.

Configuration

Before starting you should install the necessary support for performing testing onto your computer systems. This includes:

Fonts: You should install Unicode fonts that cover the complete repertoire. This includes the Arial Unicode MS font included with MS Office and James Kass's Code2000 and Code2001 fonts.

Keyboards: These allow you to enter non-ASCII data. On Windows install the Japanese IME, the French keyboard and the Chinese-Taiwan Unicode keyboard. This last item allows you to enter Unicode code points by typing their hex value.

Some Terminology

Testing basic non-ASCII support requires some knowledge of proper behavior in a system.

The main thing you are looking for is data degradation. This includes several different symptoms that you should learn to recognize. The basic problems are:

tofu
characters displayed as hollow boxes (that look like squares of tofu, get it?) or solid black blocks. This indicates that the font doesn't have a glyph (picture of the character) available. Important: This is not a processing bug.
question marks
these are displayed when a character encoding conversion has taken place and the target encoding doesn't encode the character. For example, trying to convert Japanese characters to a European encoding. This is sometimes a processing bug.
mojibake
literally "screen garbage", this where junk appears where the (correct) non-ASCII characters should appear. This is always a bug.
Tofu characters in Notepad

Tofu is sometimes what you see when characters have been turned into the mojibake: that's because unassigned characters or unusual characters often won't be in your current font. Be sure to distinguish between expected tofu and garbage.

Mojibake example
Two different kinds of mojibake in the same screen. Here Japanese Shift-JIS characters are presented in the OEM (DOS) code page on the left and the ANSI (Windows) code page on the right. Notice that some Japanese-looking characters may display, but there is some evidence of "junk". The Japanese-looking characters are meaningless character sequences, like typing "xyzzydke" on your keyboard in English.

Examples:

Here is a picture of some tofu to get us started:

Tofu on the Unicode website

Consider if was had some input that looke like this: 文字化け€àáßèëœ

If we perform a conversion from Unicode to a legacy encoding then several of the above items may occur. Proper conversions of this type result in individual characters being converted to individual question marks. The question mark is a substitution character. Basically it means: this character has no mapping in the target encoding.

Latin-1 (ISO8859-1) ?????àáßèë?

ISO8859-15 (Euro support) ????€àáßèëœ

Shift-JIS (a Japanese character set) 文字化け???????

Mozibake reading UTF-8 as Latin-1: 文字化け€à áßèëœ

Mozibake reading SJIS as Latin-1: “ú–{Œê•¶Žš‰»‚¯

It goes both ways: reading Latin-1 as SJIS: ????、珮゚齏ス

It goes both ways: reading UTF-8 as SJIS: 譁�蟄怜喧縺鯛ぎテ�テ。テ淌ィテォナ�

The last example shows how it can be hard to tell. Most of the characters are at least Japanese-looking. If you don't speak Japanese, how can you tell? You should keep track of the characters being used for testing (their count and appearance) so that you can be sure that what you have isn't mojibake.

It is worth noting that you can see some funky garbage in the page (the Unicode replacement character U+FFFD) and that there are sometimes runs of characters with a lot of punctuation. But don't rely on that. Use known sets of characters to ensure that your testing gets what you sent. Common mojibake errors occur or sometimes only appear at the first or last character on a line or in a control (where a single multibyte value is getting munged).

Basic Support Tests

Western European

Test standard Western European characters (the ISO 8859-1 repertoire). This includes accented latin script letters in the Unicode range U+00A0 through U+00FF.

Test for Euro and Windows-1252 support. Latin-1 doesn't include the Euro symbol (U+20AC). Microsoft also include 31 other symbols in the C1 control character range, including slightly less common French characters such as guillemet quotes and the ae/oe ligatures.

=80   U+20AC   EURO SIGN
=82   U+201A   SINGLE LOW-9 QUOTATION MARK
=83   U+0192   LATIN SMALL LETTER F WITH HOOK
=84   U+201E   DOUBLE LOW-9 QUOTATION MARK
=85   U+2026   HORIZONTAL ELLIPSIS
=86   U+2020   DAGGER
=87   U+2021   DOUBLE DAGGER
=88   U+02C6   MODIFIER LETTER CIRCUMFLEX ACCENT
=89   U+2030   PER MILLE SIGN
=8A   U+0160   LATIN CAPITAL LETTER S WITH CARON
=8B   U+2039   SINGLE LEFT-POINTING ANGLE QUOTATION MARK
=8C   U+0152   LATIN CAPITAL LIGATURE OE
=8E   U+017D   LATIN CAPITAL LETTER Z WITH CARON
=91   U+2018   LEFT SINGLE QUOTATION MARK
=92   U+2019   RIGHT SINGLE QUOTATION MARK
=93   U+201C   LEFT DOUBLE QUOTATION MARK
=94   U+201D   RIGHT DOUBLE QUOTATION MARK
=95   U+2022   BULLET
=96   U+2013   EN DASH
=97   U+2014   EM DASH
=98   U+02DC   SMALL TILDE
=99   U+2122   TRADE MARK SIGN
=9A   U+0161   LATIN SMALL LETTER S WITH CARON
=9B   U+203A   SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
=9C   U+0153   LATIN SMALL LIGATURE OE
=9E   U+017E   LATIN SMALL LETTER Z WITH CARON
=9F   U+0178   LATIN CAPITAL LETTER Y WITH DIAERESIS

Japanese

Test Far East Asian characters primarily using Japanese. Japanese writing consists of four scripts used together:

Latin-script (romaji)
The familiar ASCII characters, with a couple of twists.
Two phonetic scripts referred to collectively as "kana"
kanji
Han ideographs--complex characters borrowed from Chinese 感じ

There are various ranges of characters to test when working with Japanese. First there are "width" distinctions in the text. There are characters whose underlying representation in legacy (non-Unicode) encodings is two bytes long. These are called zenkaku or "wide" characters. The opposite of these are characters whose underlying representation takes one byte. These are called hankaku or "narrow" characters.

Zenkaku characters include kana of both types. There is also a set of compatibility characters that represent the ASCII range. For example: ABCDEFG as well as かたかなひらがな感じ①② etc.

Hankaku characters usually refer to single-byte kanakana characters (although ASCII can be referred to as hankaku). For example: カタカナ. Notice that the characters actually appear to be narrow!

One common use for hankaku characters is in menu items to save space.

The narrow katakana characters are not encoded in EUC-JP (an encoding commonly used on Unix). Be aware of configurations that filter out hankaku characters or convert them to their wide equivalent.

Japanese also features a few combining marks called "kuten". Some characters are "pre-composed" (that is, a single Unicode character incorporates the kuten mark) and others are distinct. The kuten must be handled correctly. Here are examples of both:

Pre-composed
ぶびばぱぴ
Combining
は゛ひ゛ふ゛は゜ひ゜ーヾ

If you process data using legacy encodings, then you should look for famous tests of trailing byte issues, such as a trialing byte of 0x5C (backslash). There is a list of Unicode code points that in Shift-JIS have a trailing byte of 0x5C later in this document. A sequence that tests this can be easily typed. The Japanese word for "source code" is "so-su" written in katakana. Type "so-su" and you've typed a sequence with a 0x5C trailing byte (like this: ソース).

For more information on keyboarding, see Learn to Type Japanese (and other languages).

The Reverse Solidus Problem

On Far East Asian operating systems, the Unicode character U+005C (yes, it is backslash) is often used in the original legacy version of ASCII (or, more properly ISO 646) to represent the local currency symbol. When Japanese, Chinese, or Korean users type a "\" they expect to see their local equivalent symbol (the yen U+00A5 "¥", yuan, or won sign U+20A9 "₩" respectively). Unicode has separate code points for these characters and it is important to differentiate the various applications of each character. For example, in path names in Windows the currency symbol should be shown in this locales, but in data you want to show the real Unicode value.

Mapping Differences

Certain characters are mapped differently by different vendors. In Japanese, for example, there are several characters that Microsoft Windows (code page 932) maps differently to/from Unicode than other systems (such as Oracle JA16SJIS or Solaris PCK versions of Shift-JIS) do. Here is a table of the mappings:

JISX0201Not WindowsWindows
0x2141 ('~')U+301C (Wave Dash)U+FF5E (Fullwidth Tilde)
0x2142 ('∥')U+2016 (Double Vertical Line)U+2225 (Parallel To)
0x215d ('-')U+2212 (Minus Sign)U+FF0D (Fullwidth Hyphenminus)
0x224c ('¬')U+00AC (Not Sign)U+FFE2 (Fullwidth Not Sign)

These four characters can result in a Unicode program losing data. For example, if you convert the same byte sequence from Shift-JIS to Windows Unicode for the "wave dash" character, you'll get U+FF5E. If you store U+FF5E in an Oracle database, you'll get a question mark (bad conversion) because Oracle doesn't map U+FF5E to wave dash.

Bidi and Complex Scripts

Bidirectional languages are those languages customarily written from right-to-left, such as Arabic and Hebrew.

Complex scripts are those whose characters change shape or are composed contexually. Complex scripts include Arabic, but also the Indic scripts (Hindi, Gujurati, Kanaada, Bengali, Gumurkhi, etc.), related scripts such as Thai, and a few oddities such as Vietnamese. Note that this last language is written in the Latin script!

Although software doesn't always provide full support for bidi display (such as reversing the screen layout or putting the scroll bar on the left side), support for these types of scripts is still important for internationalized products. The product should support correct Unicode Bidirectional Algorithm (UBA) display of bidi text and the bidi control characters (such as the RLM and LRM characters). Complex scripts should be displayed correctly and not broken by poor coding choices.

A good source for text to use in tests is the W3C I18N GEO page located here: GEO Tests. In many of the pages there are graphics of what the text should look like, followed by content you can copy (see especially the bidi and whitespace tests).

Here are some examples of complex script texts that you can use to validate display and text handling:

Thai Text: runs left-to-right, complex shaping (vowels)
งานออกแบบรายการใช้เครื่องระบบ สากล
งานออกแบบรายการใช้เครื่องระบบ1234 สากล
งานออกแบบรายการใช้เครื่องระบบLatin textสากล
Hindi (actually Devanagari) Text: runs left-to-right, complex shaping
हिन्दी the word for Hindi
शुक्रवार, जनवरी ०७, २००५ : a date using local number shapes
Arabic Text: runs right-to-left, complex shaping
قال عالم إيطالي يعمل في مشروع مسبار المريخ الفضائي إن الغازات التي تم اكتشاف وجودها على سطح المريخ قد تعطي دلائل على إمكانية وجود حياة على هذا الكوكب الأحمر.
the next item as a graphic
صفقة مع شركة روفر قد تكلفها 2000 وظيفة
Vietnamese Text: left-to-right, multiple combining marks
Khoa học gia nổi danh của Đức từ chức vì giả mạo suốt 30 năm
Korean Text: not actually a complex script, however "single-byte" Korean is composed.
자동으로 로그인합니다. this is the phrase "remember my account on this computer"

A good source for texts in multiple languages (some of the above are taken from there) is: BBC Worldservice

Advanced Unicode Testing

Unicode offers an array of additional complexities that must be tested. The include:

These types of characters need to be tested even though fonts may be difficult to come by. The display of tofu is reasonable in this case, although the right amount of tofu (one block, not two, for supplemental characters).

Unicode Test Data

To fully test whether a system "supports Unicode", you need to test the different kinds of rendering and processing that can occur. The things you need to test include selection of text; cursor movement in the text; line and word breaking; storage and retrieval; upper- and lowercasing of text; comparison of text; and so forth. The following dataset is designed to exercise these capabilities. Some of the tests cases described later will refer to a specific test in this table.

Test#CharactersUTF-16 Character ValuesComments
Supplemental Characters1U+D800 U+DC00 U+D800 U+DC01the first two supplemental characters as surrogate pairs. In UTF-16 this is four 16-bit code points ("characters"). In UTF-8 this is two four-byte characters. (corresponds to Oracle encoding "AL32UTF8"). In CESU-8 this is four three-byte characters (corresponds to Oracle encoding "UTF8")
Combining Marks and Accents2àéîōũa U+0300 e U+0301 i U+0302 o U+0304 u U+0303Combining marks for vowels (this example is not realistic)
3você nós mãe avô irmã criançavoc U+00EA n U+00F3 s m U+00E3 e av U+00F4 irm U+00E3 crian U+00E7 aPortuguese (DOS 860 test) (words for: you, we, mother, grandfather, brother, child)
4€ŒœŠš™©‰ƒU+20acU+0152U+0153U+0160U+0161U+2122U+00a9U+2030U+0192Windows-1252 test (only one of these is a Latin-1 character--copyright)
5ışık bir İyi GünlerU+0131 U+015F U+0131 U+006B
U+0062 U+0069 U+0072
U+0130 U+0079 U+0069
Turkish (dotted and dotless letter "i") ("light", "one", "good day"). NB> Dotted lowercase i upper cases to U+0130 (Capital I with dot above), while uppercase I lowercases to dotless lowercase i (U+0131)
6がざばだぱか゛さ゛た゛は゜U+309B U+309Cdakuten and handakuten marks1: both precomposed and combining forms
7אִ͏ַU+05D0 U+05B4 U+034F U+05B7 Combining Grapheme Joiner character (the sequence is from the Unicode CGJ FAQ)
8≠q̌U+D84C U+DFB4 U+2260 U+0071 U+030CSupplemental, plus combining marks (from CharMod Section 6.1)
Bidi with Latin9abcאבגדabcU+05D1 U+05D2 U+05D3 U+05D4left-right-left
10אבגדabcאבגדU+05d0U+05d1U+05d2U+05d3abcU+05d0U+05d1U+05d2U+05d3right-left-right
11אבגד 012 אבגדU+05d0U+05d1U+05d2U+05d3 012 U+05d0U+05d1U+05d2U+05d3right - weak -right
12012 אבגד 012012 U+05d0U+05d1U+05d2U+05d3 012weak-right-weak
Bidi with Asian13אבגדソースאבגדU+05d0 U+05d1 U+05d2 U+05d3 U+30bd U+30fc U+30b9 U+05d0 U+05d1 U+05d2 U+05d3right-left-right
14ソースאבגדそーすU+30bd U+30fc U+30b9 U+05d0 U+05d1 U+05d2 U+05d3 U+305d U+30fc U+3059left-right-left
15
Complex Scripts16สวัสดีU+0e2aU+0e27U+0e31U+0e2aU+0e14U+0e35Thai (greeting)
17டாஹ்கோU+0B9F U+0B9E U+0B99 U+0BCD U+0B95 U+0BCBTamil (from CharMod Appendix B)
18بِسْمِ اللّهِ الرَّحْمـَنِ الرَّحِيمِU+0628 U+0650 U+0633 U+0652 U+0645 U+0650 U+0627 U+0644 U+0644 U+0651 U+0647 U+0650 U+0627 U+0644 U+0631 U+0651 U+064e U+062d U+0652 U+0645 U+0640 U+064e U+0646 U+0650 U+0627 U+0644 U+0631 U+0651 U+064e U+062d U+0650 U+064a U+0645 U+0650Arabic (first line of the Qur'an). this text has the vowels in it and demonstrates the full complexity of Arabic text.
Numeric Shaping1901234
20عدد مارس ١٩٩٨U+0639U+062fU+062f U+0645U+0627U+0631U+0633 U+0661U+0669U+0669U+0668Arabic (from CharMod) "1998" at the end (Yes: the end. Remember that Arabic is read right-to-left)
Bidi Controls and Mirroring21
22
Private Use23U+E000 U+E001 U+E002 U+E003 U+E004
Common Scripts and Encodings24Слава Жанна Ювеналий ЯрополкCyrillic (Russian, Ukrainian, Serbian, and others) Russian names.
25Greek
26Latin-1
27Latin-2

Test Cases

Text Selection

Use data sets: 1-18

Display the text. Select the text using the mouse from both the left and right sides. The text should be selected one glyph at a time (it shouldn't jump around or allow partial character selection, you shouldn't see any tofu jump into view, etc.). Bidi text should select in logical order (that is, for texts such as #11, you should be able to have two separate regions selected at the same time). For an example of discontiguous visual selection see: [CharMod] Section 3.3.1.

Visual Rendering

Footnotes & References

Character Model for the World Wide Web 1.0: Fundamentals

W3C I18N Geo glossary

Roman Czyborra's Alphabet Soup shows various encodings and offers text downloads with Unicode code points enumerated.

Omniglot gives examples of different writing systems and explains them.

1kuten, dakuten, handakuten: Japanese tone modifying marks. These are usually precomposed and are only used with kana (either katakana or hiragana). Here we present the unprecomposed (combining) forms as a test. For a more thorough explanation, including how to type these naturally in the IME, see Biography.ms.