A Delphi "Internationalization Cookbook"

The Delphi programming language, Borland's answer to Visual Basic and the evolution of Turbo Pascal, has been around for many years now. As a rapid application development ("RAD") tool, it excels at allowing developers to write Windows applications in a hurry.

As an internationalization vehicle it sucks.

This isn't entirely Borland's fault, of course. The nature of Windows when Delphi was introduced was still closely tied to code pages. The vast majority of users were running Windows 9x or even Windows 3.x. Windows NT was something your server ran. Thus Delphi originally was not a Unicode environment. The string types and libraries were tied to code pages and thus represent a multibyte environment.

Delphi's original character type ("ShortString") could store up to 255 bytes (plus a trailing null byte) and was defined as a single byte character set roughly analogous to the U.S. English DOS code page (Cp437). As Delphi matured, it received an infusion of multibyte support in the form of the AnsiString class. This class can store arbitrarily long strings that consist of an array of bytes ("characters", says Borland's documentation, although this is wrong) in a particular encoding (also known as a "code page" in Windows). In order to understand the problem that encodings represent, we need to define our terminology a bit more carefully.

A collection of characters, each of which is assigned an integer value, is called a character set. A character set is just a collection; each character set may have one or more encodings. An "encoding" is the particular byte pattern used to represent the characters in the memory of a computer.

Single byte encodings (or SBCS) take each available byte value from 0x00 through 0xFF and assign a character to it. Each byte has a 1:1 mapping to a character in that encoding. For most SBCS encodings, the mapping of bytes to Unicode characters is also 1:1.

Developers sometimes fall in love with this relationship between bytes and characters and seek every chance to "optimize" their code for single-byte encodings, but as we'll see, this causes problems later. In practice, developers need to understand that single byte encodings are a special case of multibyte encodings.

Multibyte encodings (or MBCS) are used to encode character sets that are too large to provide a unique byte to each character. The languages most people think of for this are the Far East Asian languages, such as Japanese, Chinese, and Korean. The Japanese call these ideographic characters kanji and use them to write a portion of their language in combination with other writing systems, but you'll sometimes hear (again erroneously) some people talk about "kanji-enabling" or "kanji support" when they mean "multibyte".

Multibyte encodings map a sequence of one, two, three, and sometimes four bytes to a single character. The number of bytes used to represent each character may vary and is independent of other characters encoded. Just how depends on the encoding scheme used.

A good example of this is the Japanese encoding Shift-JIS. By themselves, the bytes 0x00 through 0x7F and 0xA0 through 0xDF represent single-byte characters. The bytes in the ranges 0x81 through 0x9F and 0xE0 through 0xFC are called "lead bytes" and introduce a multibyte (in this case a two byte long) sequence. Following a lead byte are the "trail bytes". In the case of Shift-JIS, the trailing bytes fall into the ranges 0x40-0x7E and 0x80-0xFC.

Notice that the trailing byte ranges in Shift-JIS overlap the range of both the single byte and lead byte characters. If you place a pointer into the middle of a byte stream of Shift-JIS, you cannot tell if a particular byte is a lead or trailing byte: you have to read from the beginning of the encoded sequence to know for sure. Here is an example of Shift-JIS to give you an idea:

character: @      A      B      、        =
bytes:     0x40   0x41   0x42   0x81 0x41 0x81 0x81

Unicode:   U+0040 U+0041 U+0042 U+3001    U+FF1D

Wide encodings use a fixed number of bytes, generally two, for each character. In some cases the size of a "wide" encoding is dependent on processor architecture and support libraries (so that a 32-bit processor will have 32-bit wide characters). A wide encoding is much like a single byte encoding, in that the relationship of bytes to characters is fixed to some integer ratio (16 bits per character, 32 bits per character...).

In Delphi, Unicode is represented using a wide encoding (UTF-16) and a special class called "WideString". Changing over to Unicode requires porting the code to the most recent versions of Delphi coupled with changing all the text data types throughout the application and the associated method or function calls in the code. In other words, it's a big project. For new projects or for code that can afford to be ported, converting to use Unicode is the best choice: it gives you a product that can process and display data regardless of its source on any configuration of Windows.

Making Delphi code behave in a multibyte world, though, isn't so hard and can be justified as an approach, as long as the developer is aware of the code page limitations imposed by that choice. As long as the program can live with a multibyte ANSI (or OEM) code page, it can do virtually anything it needs to do.

There are basically two problems a developer must address to multibyte enable a Delphi program. First, the program must deal with fonts and encodings in forms (used to display characters on the screen). Second the program needs to deal with character processing in code.

Forms and Fonts

Adjusting the visual display of a Delphi program is the first problem. Most forms (.dfm files) are set up with the default code page of the system they were authored on and use fonts that are not ready for display of characters from another code page. For example, the font "Times" works well for most alphabetic languages--versions of it exist for Latin, Greek, Cyrillic, and Arabic alphabets (technically Arabic is an abjad), but not so well for Far East Asian languages, since it doesn't have glyphs (pictures of the characters) available for these large character sets. Forms that use the wrong fonts can display "tofu", but usually the font and code page are combined and you see "mojibake" instead.

In order to fix forms, the developer must make a few changes to form's properties. First, the form character encoding should always be set to the right encoding. Any Windows system can have two or three encodings associated with different aspects of the system. The "ANSI code page" is the encoding used in a Window (such as a title bar, menu, or dialog box). The "OEM code page" is the encoding used by the command shell ('cmd' or 'command.exe'). The system code page is used by the console not associated with a specific user (such as a Windows Service) and is generally the ANSI code page associated with the locale used by the service in question.

Internally, Windows NT (which includes 2000, XP, and all later operating systems) is really based on UTF-16 Unicode. The display is fully capable of displaying text in the full range of Unicode, even in the command shell. The code page limitation is a compatibility layer that allows programs (such as any Delphi program) to interact with the system as if it were tied to a specific character encoding (as Windows 9x was).

Delphi forms are thus interpreted according to a specific encoding assigned in the form. You could assign, for example, a Japanese code page to a form file and all the strings and fonts on that form (and its children) would display using that encoding and character set. See Appendix A for the list of charset values. The correct value to use is always DEFAULT_CHARSET (0x01).

The second thing the developer must set is the font name. A statically named font such as Times won't cover all potential languages. The alternative is to use a logical font. On Windows 2000 and later, this font is called "MS Shell Dlg 2". It isn't listed in any font drop downs: the developer must type the name in manually. Once set, the font uses font linking in order to produce the right display for a given language configuration of Windows. If the code must also run on Windows 9x, the font "MS Shell Dlg" (without the "2") can be used instead. This produces the same effect and works with modern Windows versions too, but the font used is different on English and other Western European configurations.

Character Processing and Code

As with C programming, the main obstacle to multibyte handling in a Delphi program is the fact that there are multibyte unaware functions and that developers are commonly taught to use these instead of their multibyte aware counterparts. Developers are basically taught to code in a style in which "1 char == 1 byte == 1 character", that is, where one character is always exactly one byte long. Multibyte characters from encodings such as Shift-JIS (Cp932), GB2312 (Cp950), or UTF-8 (Cp65001) don't work that way. In a multibyte encoding, characters can be one, two, three, or even four bytes long and each character's length is independent of the characters around it.

Delphi's own documentation says:

Object Pascal supports single-byte and multibyte characters and strings through the Char, PChar, AnsiChar, PAnsiChar, and AnsiString types. Indexing of multibyte strings is not reliable, since S[i] represents the ith byte (not necessarily the ith character) in S. However, Delphi's standard string-handling functions have multibyte-enabled counterparts that also implement locale-specific ordering for characters. (Names of multibyte functions usually start with Ansi-. For example, the multibyte version of StrPos is AnsiStrPos.) Multibyte character support is operating-system dependent and based on the current Windows locale.

This means that almost any code that increments a pointer to a character (byte) array using inc() or += 1 is doing something wrong. A partial list of functions that are not multibyte aware or which have a tendency to be abused or used improperly is:

Pos()		=>	AnsiPos()
PosEx()		-	Rewrite code to use AnsiPos()
Copy()		-	Check at surrounding code
Length()		-	Check at surrounding code
Insert()		-	Check at surrounding code
Delete()		-	Check at surrounding code
UpperCase()	=>	AnsiUpperCase()
LowerCase()	=>	AnsiLowerCase()
QuotedStr()	=>	AnsiQuotedStr() and AnsiDequotedStr()
CompareText()	=>	AnsiCompareText or AnsiSameText()
CompareStr()	=>	AnsiCompareStr or AnsiSameStr()
StringReplace()	-	Check at surrounding code

Combining a code crawl with searches for these functions can produce code that detects multibyte characters correctly. Let's look at each of the above functions in turn to see why they matter to us.

Navigating Multibyte Strings

First we need to find out how to find character boundaries. That is, is the byte we are pointing at a lead byte or trailing byte in a multibyte character or is it a single byte character? Recall that in Windows code pages, a trailing byte can have any value (in our Shift-JIS example, they fall into the range 0x40 through 0xFF; some Windows encodings have lead bytes low as 0x30). Other encodings, such as UTF-8, can have more than two bytes per character. There are two kinds of operation that a programmer must be able to perform in order to write internationalized code with these encodings:

First, the code has to recognize character boundaries (or if the pointer is in the middle of a character). Support for this is provided via the functions ByteType and StrByteType. These functions allow a program to recognize when a "char" (byte) is actually a single byte value (or when it is part of a multibyte value). For example:

Original Function

function TestFunc(const aStr: string): string; var
  sLen: Integer;
  i: Integer;
begin
  sLen := Length(aStr);
  i := 0;
  while i < sLen do
  begin
    if aStr[i] = '.' then
    begin
      Insert('*',aStr, i);
	inc(sLen);
    end;
    inc(i);
  end;
  Result := aStr;
end;

Corrected Function

function TestFunc(const aStr: string): string; var
  sLen: Integer;
  i: Integer;
begin
  sLen := Length(aStr);
  i := 0;
  while i < sLen do
  begin
    if (ByteType(aStr,i) = mbSingleByte) and (aStr[i] = '.') then
    begin
      Insert('*',aStr, i);
	inc(sLen);
    end;
    inc(i);
  end;
  Result := aStr;
end;

Second, the code has to be able to move a string pointer over a whole character at a time (or know the number of bytes in the current character). The function StrNextChar increments a pointer (instead of using inc(), which isn't multibyte aware).

Pos, PosEx

Pos() and PosEx() function like the C programming language's strchr and strstr functions respectively: they find a "character" (byte) or string (array of bytes) within another string, returning an integer for the first position in the string. Alas, lead and trailing bytes are not something these functions know about. The AnsiPos() function does know about multibyte characters. Use it exclusively instead.

Copy, Insert, Delete, StringReplace

These functions modify strings, which are arrays of bytes. The key is to use each one in a multibyte safe manner. That means detecting that one is indeed on a character boundary and copying only up to a character boundary (or inserting or deleting only whole characters). This doesn't mean replacing the function call: it means using multibyte aware logic to find the integer values to pass to these functions.

UpperCase, LowerCase, QuotesStr, CompreText, CompareStr

These functions are also not multibyte aware or are not locale aware. As a result, they produce problematic results with multibyte text. They can also product culturally insensitive results (for example, in comparisons or when performing casing in specialized locales, such as Turkish and Azerbaijani, that use different casing rules than English does).

For a (Java-centric) look at character encodings and the complexities involved, see: Are We Counting Bytes Yet?.

Appendix A. List of Charsets in Delphi

0 - ANSI_CHARSET
ANSI characters

1 - DEFAULT_CHARSET
Font is chosen based solely on name and size. If the described font is not available on the system, 
Windows will substitute another font.

2 - SYMBOL_CHARSET
Standard symbol set

4d - MAC_CHARSETLT
Macintosh characters

80 - SHIFTJIS_CHARSET
Japanese shift-JIS characters

81 - HANGEUL_CHARSET
Korean characters (Wansung)

82 - JOHAB_CHARSET
Korean characters (Johab)

86 - GB2312_CHARSET
Simplified Chinese characters (Mainland China)

88 - CHINESEBIG5_CHARSET
Traditional Chinese characters (Taiwanese)

a1 - GREEK_CHARSET
Greek characters

a2 - TURKISH_CHARSET
Turkish characters

a3 - VIETNAMESE_CHARSET
Vietnamese characters

b1 - HEBREW_CHARSET
Hebrew characters

b2 - ARABIC_CHARSET
Arabic characters

ba - BALTIC_CHARSET
Baltic characters

cc - RUSSIAN_CHARSET_DEFAULT
Cyrillic characters

de - THAI_CHARSET
Thai characters

ee - EASTEUROPE_CHARSET
Sometimes called the "Central European" character set, this includes diacritical marks for Eastern European countries

ff - OEM_DEFAULT
Depends on the codepage of the operating system