webMethods, Inc. | Get There Faster.

Are We Counting Bytes Yet?

Writing Encoding Converters Using Java NIO

by Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

Introduction

Before the advent of Unicode internationalization projects mostly focused on enabling software to support a specific language, one language at a time. Multibyte enabling products for specific markets were the most common projects and developers involved in internationalization were well versed in the mind-numbing intricacies of the various encodings used on different platforms to support different languages. The vast number of encodings and wide range of different encoding structures made writing generalized software that supported multiple languages an arduous and complex process, if not downright impossible. Support for individual encodings often had to be hand engineered and tested extensively. Porting products from one language or platform to another was difficult and expensive.

Internationalization folks called this dismal situation "Code Page Hell".

Unicode was a direct result of this arcane and now sometimes forgotten chapter in software history. Unicode mostly tames the problems of engineering multilingual software, either directly (because software uses Unicode and its encodings internally) or indirectly (by providing a stable pivot format for encoding conversions and mapping).

Modern programming languages such as Java ingrain Unicode into their very fabric: character data in Java is expressed solely in terms of Unicode. Everything external to Java is an array of bytes and conversions between the internal world of characters and external world of bytes is provided as part of the java.io and java.nio packages. Java runtimes provide a large set of built-in converters to support reading and writing data in a vast array of encodings.

So why this paper about legacy encodings and "Code Page Hell" if Java has all this wonderful Unicode support?

webMethods provides a number of products that process flat text files, such as EDI or mainframe files. These files can use both character (actually, byte or byte sequence) delimited fields (such as comma separated values) or length-delimited fields (where each record and/or field has a predetermined length—counted in bytes), or a combination of these. Products that support these types of files need to be able to process many different legacy encodings.

Developers, especially Java developers, have grown unfamiliar with the details and intricacies of legacy encodings and need reference material, education, and code to deal with the different types of encodings and these aren't, it turns out, as readily available as they used to be, nor do they focus on Java.

Also, the default Java encoding classes aren't really suited to "counting bytes", that is, processing multibyte encodings on a byte-oriented basis. The character converters embedded in Java were optimized for converting entire input buffers in the most efficient manner. Processing length-delimited files is difficult and inefficient at best. Certain encodings are very difficult or impossible to support in this manner, even with very low performance expectations.

Finally, while Java runtimes typically provide support for a large number of encodings, there are many proprietary encodings still in use that are unsupported by Java. For example, some of webMethods's Japanese customers have data that uses the KEIS encoding. Many mainframe applications exist which rely on buffer lengths or combine delimiters with buffer lengths.

The result of these requirements was a development effort based on new functionality in J2SE 1.4. The Java "New I/O" (NIO) package of classes provide support for installing character converters for obscure encodings or with specialized capabilities.

This paper describes how various legacy encodings work; how to write your own character converters in Java; and one solution to the problem of counting bytes in text files. The solution described here supports (at this date) 873 encodings. No, that's not a typo: can you imagine supporting bidirectional conversions between 873 encodings without Unicode?

About Legacy Encodings

Before we can explore how to write a character converter that can count bytes, it helps to understand how the structure of different kinds of legacy encodings makes this a difficult problem.

There are a variety of different encoding schemes used by legacy encodings. This paper will discuss five basic types of encodings: single-byte, simple multibyte, wide, simple stateful, and complex stateful encodings.

Single byte encodings (or SBCS) take each available byte value and assign a character to it. Each byte has a 1:1 mapping to a character in that encoding. For most SBCS encodings, the mapping of bytes to Unicode characters is also 1:1 (and nearly all of these encodings map entirely to the Basic Multilingual Plane of Unicode, which means that each character consumes just one UTF-16 code unit). This means that a Java program can easily count the size of a field or record based on the number of characters (chars) read or written.

Developers sometimes fall in love with this relationship between bytes and characters and seek every chance to "optimize" their code for single-byte encodings.

Multibyte encodings (or MBCS) are used to encode character sets that are too large to provide a unique byte to each character. The languages most people think of for this are the Far East Asian languages that use Han ideographs, such as Japanese, Chinese, and Korean. The Japanese call these ideographic characters kanji and use them to write a portion of their language in combination with other writing systems, but you'll sometimes hear (again erroneously) some people talk about "kanji-enabling" or "kanji support" when they mean "multibyte".

Multibyte encodings map a sequence of one, two, three, and sometimes four or more bytes to a single character. The number of bytes used to represent each character may vary and is independent of other characters encoded. Just how depends on the encoding scheme used.

Multibyte encodings map to Unicode in a slightly more complex way than single-byte encodings do. Each characterin the multibyte encoding will generally map to a single Unicode character, but since the number of bytes that make up the legacy character varies, one Unicode character might map to a variable number of bytes in a specific encoding.

Various types of multibyte encodings exist and we'll look at these next.

Simple multibyte encodings usually use bit patterns in the encoding's byte sequences to determine how many bytes there are in a specific character.

A good example of this is the Japanese encoding Shift-JIS. By themselves, the bytes 0x00 through 0x7F and 0xA0 through 0xDF represent single-byte characters. The bytes in the ranges 0x81 through 0x9F and 0xE0 through 0xFC are called "lead bytes" and introduce a multibyte (in this case a two byte long) sequence. Following a lead byte are the "trail bytes". In the case of Shift-JIS, the trailing bytes fall into the ranges 0x40-0x7E and 0x80-0xFC.

Notice that the trailing byte ranges in Shift-JIS overlap the range of both the single byte and lead byte characters. If you place a pointer into the middle of a byte stream of Shift-JIS, you cannot tell if a particular byte is a lead or trailing byte: you have to read from the beginning of the encoded sequence to know for sure. Here is an example of Shift-JIS to give you an idea:

character: @      A      B      、        =
bytes:     0x40   0x41   0x42   0x81 0x41 0x81 0x81
Unicode:   U+0040 U+0041 U+0042 U+3001    U+FF1D

Wide encodings use a fixed number of bytes, generally two, for each character. In some cases the size of a "wide" encoding is dependent on processor architecture and support libraries (so that a 32-bit processor will have 32-bit wide characters). A wide encoding is much like a single byte encoding, in that the relationship of bytes to characters is fixed to some integer ratio (2 bytes per character, 4 bytes per character...). Mapping to Unicode is also generally fixed to the same ratio.

The most common wide encodings are the IBM DBCS code pages, such as code page 300. A "double-byte character" set is one in which each character takes two bytes. In this case the name "DBCS" is very strictly accurate, since every character takes exactly two bytes to encode! Sometimes the term is applied to encodings like Shift-JIS, in which characters can take either one or two (but never more) bytes. General usage tends to be sloppy, though: you'll hear talk of "double-byte enabling" or "double-byte support" when the real goal is multibyte enabling or support for a specific language (such as Japanese).

Simple stateful encodings are multibyte encodings that do not use bit patterns or lead/trail byte combinations to identify single and multibyte character sequences. Instead they use a byte or sequence of bytes to shift between different encodings or modes.

IBM multibyte code pages (which they call MBCS code pages) often use this approach, especially for EBCDIC encodings. The IBM code page will use two encodings together: a single-byte code set and a double-byte (wide) code set, with sequences of characters in each encoding separated by control character. The "Shift-In" control code "SI" (0x0E in ASCII) to shift to the wide code set and the "Shift-Out" control code "SO" (0x0F in ASCII) to shift back to the single byte code set. For example, IBM Code Page 930, a common Japanese mainframe encoding, uses code page 290 ("ASCII", er, EBCDIC and kana characters) and code page 300 (the Japanese kanji characters).

In a simple stateful encoding, the shift sequence is not part of either a singlebyte or a multibyte character, so you can have bytes that map to no Unicode character.

Stateful encodings also introduce another complexity into mapping between the encoding and Unicode. In the other encodings mentioned so far, each Unicode character required a fixed number of bytes in the legacy encoding. With stateful encodings, writing a character may require the addition of a shift sequence (to change modes). And the end of a field or record may require the insertion of a shift sequence to ensure that the file can be randomly accessed. This means that the number of bytes required to write a particular character is variable.

character: @      A      B     (SI) 、         =        (SO)
CP930:     0x7C   0xC1   0xC2  0x0E 0x43 0x44  0x42 0x7E 0x0F
Shift-JIS: 0x40   0x41   0x42       0x81 0x41  0x81 0x81
Unicode:   U+0040 U+0041 U+0042     U+3001     U+FF1D

Complex stateful encodings are a grab bag of more complicated encodings. Generally these encodings switch between more than two states and may include special kinds of escape sequences to do this.

One common example is the ISO-2022 family of encodings. ISO-2022 is commonly used in email applications. The design of ISO-2022 allows it to be used with many different encodings in the same document. Each encoding has its own three byte escape sequence that introduces it. Specific language or country-specific variations of ISO-2022 use combinations of character sets to provide complete coverage for a language. For example, ISO-2022-JP encodes the US-ASCII, JISX0201, and JISX0208 character sets, each with a three byte escape sequence.

Special Unicode Considerations

It is important to note some of the things that the descriptions above gloss over. Some encodings have indeterminate mappings to Unicode. That is, some byte sequences can be mapped to more than one Unicode character and/or some Unicode characters can be mapped to more than one byte sequence. Stateful encodings, for example, sometimes encode the same character in two different states.

Many multibyte encodings include support for "private use" characters. Unicode also has private use characters too and these ranges are usually mapped together, often with the first private use byte sequence in the legacy encoding mapped to the first Unicode PUA character (U+E000) and so on. Private use character mapping may become complicated, since Unicode, as a larger set of characters than most legacy encodings, may contain the character that is "private" in some legacy encoding. At the same time, some multibyte encodings also include characters not encoded in or not yet encoded in Unicode and these are also generally mapped to Unicode private use code points. This can make it more difficult to work with legacy encoded data converted to or from Unicode.

Finally, some characters in a legacy encoding may map to a sequence of characters (such as a base character followed by combining characters) in Unicode. Similarly a combining sequence in Unicode may map to a single character byte sequence in a legacy encoding (this is a more common case).

Unicode in Java

The descriptions of legacy encodings above are very careful to use the term "character" with respect to mapping. There is a fundamental difference between a Unicode character and a Java char or Character object.

Java's concept of a character is related to the encoding scheme used by Java, so we'll need to examine that briefly. To understand what that means, we need to differentiate between Unicode characters, Unicode code points, and their physical representation in a character encoding (code units in a coded character sequence). The Unicode character set has three common encodings: UTF-32, UTF-16, and UTF-8.

The relationship in Unicode is: Unicode characters are the abstract values in the character set. For example, there is a character called LATIN CAPITAL LETTER E WITH CIRCUMFLEX which looks like this: Ê. Each character is assigned a code point, which can range from 0x0 to 0x10FFFF. For example, LATIN CAPITAL LETTER E WITH CIRCUMFLEX (Ê) is assigned the code point U+00CA. A character encoding then assigns a specific code unit sequence (a 'byte' is one kind of code unit) to each code point.In UTF-8 the byte (code unit) sequence for our example is 0xC3 0x8A.

Internally, Java encodes characters using the UTF-16 encoding. The char data type is a 16-bit unsigned integer value representing one UTF-16 code unit. Thus the Character and String classes are strongly tied to this representation of Unicode. In UTF-16, characters outside the Basic Multilingual Plane (that is, those with code points larger than 0xFFFF) are represented by a pair of surrogate code units. Thus the character at code point U+10000 is represented in Java by a sequence of two Java chars (0xD800 0xDC00)

This means that some of the legacy encoding-to-Unicode relationships outlined above in which one legacy character maps to one Unicode character may actually result in a sequence of more than one char (UTF-16 code units) in Java.

About Java NIO

Java's New I/O package was introduced with J2SE 1.4 and includes support for many interesting I/O capabilities, such as non-blocking I/O. In this paper we are concerned mostly with the support for installable encodings using a new Service Provider Interface or SPI called java.nio.charset.spi.CharsetProvider.

SPI classes allow other classes (of type java.nio.charset.Charset, in this case) to be installed as if they were built into the Java runtime environment. Special meta information is stored in the JAR file or classpath containing the classes so that the JVM can identify and install these classes when starting up or via a specific thread's ClassLoader. The classes can then be accessed as if they were native parts of the JVM.

In the case of Java NIO, the idea is that you write a Charset class that identifies the encoding you wish to support. This encoding is represented by a name (a String) and one or more aliases. Encoding from Unicode (characters) to the encoding (bytes) is done by writing a class that extends java.nio.charset.CharsetEncoder. Decoding from the bytes into Unicode (characters) is supported by writing a class that extends java.nio.charset.CharsetDecoder.

Once installed, you can access your custom encoding just by using its name (or an alias) wherever you would normally use the name of a built-in character encoding. For example, if my Charset had a name of "myEncoding", I could write this code:

try {
   InputStream is = new ByteArrayInputStream(myBytes);
   InputStreamReader isr = new InputStreamReader(is, "myEncoding");
   // read characters with your StreamReader
} catch (UnsupportedEncodingException e) {
   // if you've done it correctly, you won't get here
}

Counting Bytes: Considering the Problem

Now let's look at the problem in more detail.

There are many types of text file for which the standard Java encoders will meet all of the processing requirements. Files that are structured using XML, for example, can be freely converted to Unicode prior to processing, as can many delimited files, such as the familiar "comma separated values" format.

One class of files where this won't work, though, is the length delimited file. In length delimited files, records and fields are delimited by the offset into the file or record. What offset would that be? If you read the documentation for these file formats, they typically refer to "characters". But in reality most of these files are really specified in terms of offset in bytes.

Various important formats are length delimited. Some of the common applications include EDI formats and mainframe generated data files of all flavors. Typically the file formats were originally designed with an assumption of a single-byte character set (so that one byte is one code unit is one character in the file), but were later adapted for multibyte encodings.

In their simplest form, each record consists of a single line of text. The start of the record indicates its type and sometimes its length. Processors that know the file format can parse the fields out of the record based on byte offset into the record.

In some cases, the records are fixed length and byte offsets separate both fields and records. In other cases there may be a mixture of character delimited fields (or records) and length delimited ones in the same file.

In mainframe files in particular, you may find that each field has its own associated encoding. For example, COBOL copybook data types such as ZonedDecimal or PackedDecimal might be used for one field, while another field uses an IBM MBCS code page and still another an IBM DBCS (wide character set) code page.

To understand the scope of the problem, we need to consider the specific edge cases for byte-length delimited files and the problems associated with processing them in Java. As mentioned above, legacy encoded files are treated as byte arrays or streams in Java. One way to work with byte counted fields would be to work strictly in terms of bytes.

The problem, of course, is that most of the data being inserted into these fields and records consists of text and the processing often needs to examine the value in terms of characters too.

There are also many formats that mix byte counts and character delimiters. The character delimiters are true characters, not just bytes that happen to also be characters (as with single-byte encodings) and that includes the possibility of multibyte sequences (as awful as that idea sounds to single-byte focused developers).

Because of the structure of multibyte encodings, a Java CharsetDecoder must read from the beginning of the text buffer (record or file) and maintain state throughout the decoding process (and conversely, a CharsetEncoder must do the same when writing a byte buffer). This is an inefficient process, so Java's encoding mechanisms are designed to consume an entire buffer all at once when encoding or decoding. This means that it isn't possible to tell how many bytes were consumed or generated for a given character. (It is possible to determine how many bytes were consumed for the whole buffer, but this is insufficient for processing byte-counted fields or records.)

Consider a field whose length is 10 bytes long. If we have a value we wish to write to this field, how many characters can be successfully written without going over the limit?

In a single-byte encoding this is usually easy to predict: an array of 10 Java chars is 10 bytes. Wide encodings offer similar predictability: in a 16-bit wide encoding, a 10 byte field can accept 5 characters.

Simple multibyte encodings make things more complicated. Let's consider Shift-JIS again. If the value contains only single byte ASCII or kana characters, then the field can accept 10 characters. If the value contains only two-byte characters, such as kanji, then the field can accept 5 characters. But the field can also contain a mixture of single- and multi-byte characters. Here are a number of the possible combinations:

SSSSSSSSSS (10)
LTLTLTLTLT (5)
SLTLTLTLT? (5/6)
SSLTLTLTLT (5)
SSSLTLTLT? (6/7)
SSSSLTLTLT (7)
SSSSSLTLT? (7/8)
SSSSSSLTLT (8)
SSSSSSSSS? (9)
LTSLTSLTSS (7) // mixed in any combination

S = single byte character
L = lead byte (in a two-byte character)
T = trail byte (in a two-byte character)
? = pad byte

Depending on the characters in the value and the order of the single- and multibyte characters, as few as five and an many as 10 characters can be written to the field. In the illustration above, a "pad byte" represents a single byte character written to fill out the field when only a portion (the lead byte, in this case) of a multibyte sequence can be written to the field. Writing a portion of a character has several bad effects on the resulting file.

First, the data in the file contains a badly formed character. When reading the entire file as text, the characters following that lead byte will appear to be random junk because the lead byte will combine with nearly any other byte (including another lead or single byte in an adjacent field) to form a valid character. This makes the file hard to debug or work with.

Secondly, having partial characters propagates through the system. Even if the file is correctly processed later, there will be a replacement character or other bit of "lint" that appears in the data. The original character cannot be recovered and the validity of the data may be questioned. In Japanese this kind of garbage data is called "mojibake", a term that seems to convey the meaning in many languages.

Finally, in many encodings the state of the coder is affected by whatever appears in the byte stream. Random junk may leave the encoder in an unrecoverable state. For example, if a valid lead-byte is written to the end of a field, the coder may read the next byte (starting a new field, and thus an independent character) as a trailing byte and emit an entirely wrong character.

These issues are exacerbated with stateful encodings. Consider IBM Code Page 930, the IBM MBCS code page that uses 0x0E and 0x0F to switch between single-byte and wide EBCDIC encodings. Here are some of the possible 10 byte combinations:

SSSSSSSSSS (10)
iLTLTLTLTo (4)
SiLTLTLTo? (4)
SSiLTLTLTo (5)
SSSiLTLTo? (5)
SiLToSiLTo (4)
SSSSiLTLTo (6)
SSSSSiLTo? (6)
SSSSSSiLTo (7)
iLToSiLToS (4) // all mixed up

S = CP 290 (a single-byte encoding)
LT = CP 300 (a double-byte encoding)
i = Shift-In
o = Shift-Out
? = pad byte

Note: the paired DBCS bytes above are not actually "lead" and "trail" bytes in the way that we meant when talking about stateless encodings. They are shown using the letters L and T for clarity.

Since the state shifts consume bytes, the smallest maximum is now just four characters, instead of five, and there are many ways to construct that minimum. Note that most stateful encoding files are written so that each field returns to the default state (that is, the SI and SO characters are always paired in a field) so that the fields can be accessed randomly in a record. This avoids the need to read the entire record or file just to retrieve a specific field. In addition, many file formats will mix encodings in a file (mainframe files sometimes do this).

More complex encodings may have even more complex characteristics. For example, GB18030 is a stateless multibyte encoding, which uses sequences of one, two, or four bytes per character:

SSSSSSSSSS (10)
LTLTLTLTLT (5) // two-byte characters
LTTTLTTTLT (3) // 2 four-byte characters followed by a two-byte character
LTTTLTTT?? (2) // 2 four-byte characters followed by pad bytes
SLTTTSLTTT (4) // 2 single-byte and 2 four-byte characters mixed

And ISO-2022 uses a three byte escape sequence to switch between different encodings. Here is a similar profile of ISO-2022-JP:

SSSSSSSSSS (10)
SS^$JLT^(B (3)
S^$JLT^(B? (2)
^$JLTLT^(B (2)
^(Issss^(B (4)

^$J = ESCAPE $J (shift to JISX0208 mode)
^(B = ESCAPE (B (shift to ASCII mode)
^(I = ESCAPE (I (shift to JISX0201_KANA mode)
S = single byte ASCII byte
s = single byte JIS X0201 (kana) byte
LT = two-byte JIS X0208 sequence

Since legacy encodings do not necessarily use a fixed number of bytes per character or provide blocks of characters in Unicode that are neatly arranged in terms of byte count in the legacy encoding, scanning the (Unicode) character sequence for a value to determine how much of it to write to a fixed length field and how many bytes will be taken up by shifting and escape sequences becomes inordinately expensive.

Therefore our requirements for a converter included a number of things that regular Java converters cannot currently do:

The converter must also do all of the things regular converters can do while doing the above:

It turns out this can be done with a generalized design.

The open source project ICU, which is a project sponsored by IBM, provides internationalization support classes for the Java programming language. In fact, many of the classes built into J2SE were originally part of the forerunner to the ICU for Java (ICU4J) library. ICU provides about 800 mapping tables and converter code for the C/C++ version of the library (ICU4C). These same converters can be used in Java via the ICU4JNI library.

As the Java library name suggests, the converters in ICU4JNI are written in C for efficiency. To support a specific platform, the converters must be compiled using the platform specific C language compiler. Then JNI calls can be made from Java to the converter. This design isn't a pure Java design and it isolates the logic capable of counting bytes and determining field boundaries and such from the Java logic that controls it. It works well with the existing Java design, in which buffers are coded in a loop until exhausted, but it doesn't solve our problems with reading and writing fields with an explicit length limitation.

The data and encoder approach, however, are still useful. Borrowing the data file format (.ucm) and the original design for the C-based coders, we were able to create a design that supported all of our requirements with relative efficiency.

Relative Performance

What does "relative" efficiency mean? The coders we'll meet below are slower, usually by an order of magnitude, compared to their JVM native implementations.

For example, the UTF-8 implementation built into JVM 1.4.2 required between 20 and 50 milliseconds to decode a 64K buffer, while the webMethods implementation described below required about 200 milliseconds to do the same task in the same configuration. This may seem like a serious efficiency drop, but compared to what? The traditional encoder can't easily construct complex length-and-character delimited files and the Java code required to accomplish this task is seriously slower than the coder's efficiency loss.

EncodingJVM 1.4.2 TimewebMethods Time
UTF-840 ms190 ms
GB1803030 ms321 ms
Shift-JIS30 ms270 ms
ISO 8859-110 ms250 ms

Design and Implementation

Hooking into the JVM: the CharsetProvider

The first step in creating your own legacy encoding support is to write a class that extends java.nio.charset.spi.CharsetProvider. This is the class that the service provider interface uses to retrieve Charset objects later. This class only needs to implement two methods. The method charsets() returns an Iterator over the list of encoding names supplied by the provider. The method charsetForName(#string) is responsible for instantiating a Charset object (if one exists) for the given encoding name. Here is one implementation:

    public Charset charsetForName(String charsetName) {
      String csn = charsetName.toUpperCase();
      if (!csn.startsWith(PREFIX)) csn = PREFIX + csn;
      // mCharsets is a Hashtable of installed mapping tables
      if (mCharsets.containsKey(csn)
         return wmCharset.forName(csn);
    }

In webMethods's provider's constructor we scan the classpath for mapping tables with a particular extension in a particular Java package (data.mappings.*). This allows encoding tables to be added to the classpath independently of the provider's code, allowing customers to create their own encodings without having access to the code in the charset provider or any of the related classes. Note the CharsetProvider needs a copy of the list of encoding names (in order to implement the iterator in the charsets method) or obtain it from its child Charset classes. In our implementation the provider scans the classpath for mapping tables, so it builds the list shown in its constructor.

To make the JVM actually load the provider on startup, you have to add a file called java.nio.charset.spi.CharsetProvider to the META-INF/services directory of the JAR or path containing the provider class. This file contains a single line of text identifying the classname of our provider. In this case:

com.wm.g11n.io.charset.wmCharsetProvider

Accessing Your Encoding: the Charset Class

The next step is to provide a class that extends java.nio.charset.Charset. This class is responsible for creating the actual CharsetEncoder and CharsetDecoder objects used in the coding process and is the class actually instantiated by the provider (in the example code above, this is done by calling the static method wmCharset.forName(#enc)).

Charset has a number of interesting utility methods. The main ones of interest to us are the ones that create encoders (newEncoder()) and decoders (newDecoder()). Most of the utility methods in Charset call methods on the encoder or decoder classes to check for the ability to encode certain sequences and so on. Generally there is no need to override any of the utility methods on Charset in your own implementation, unless the Charset can provide an efficiency boost by hardcoding attributes of the encoding directly. Since wmCharset is a general-purpose class that supports many encodings, we didn't implement any of the utility methods.

The Encoding Machinery: CharsetEncoder and CharsetDecoder

The CharsetEncoder and CharsetDecoder classes are where the real work of converting to or from Unicode takes place. In addition to implementing the mapping, each of these classes has methods that provide information to outside classes about the structure of the encoding. The methods are used to allocate buffers and perform other housekeeping. Read the JavaDoc carefully when writing your own implementations.

For example, internally java.io.OutputStreamWriter allocates java.nio.ByteBuffer objects to store output based on the maximum number of bytes required to write a single character. This value has to be correct for all possible values in all possible encoding states or you'll get BufferOverflowExceptions. The OutputStreamWriter calculates this by calling the method CharsetEncoder.maxBytesPerChar(). These values can be tricky: it's the maximum number of bytes required to write an arbitrary character in a sequence. In a stateful encoding a shift sequence may be required to write the next arbitrary character, so the length must include the length of the longest shift sequence in the encoding. Here are some examples of the values for different encodings:

Encodinglongest character in bytesmaxBytesPerChar
UTF-843*
GB1803044
Shift-JIS22
IBM-139923
ISO-2022-JP36

* In UTF-8 each of the supplemental characters (that is those greater than U+FFFF) requires four bytes to encode. However, recall that Java uses UTF-16 internally. The "char" in maxBytesPerChar is a UTF-16 code unit. A supplemental character (represented in Java by a surrogate pair) requires two UTF-16 code units and thus the four bytes for the character are split over two chars. Thus the longest value for a single char is 3.

If you were to write a separate provider/charset/encoder/decoder set for each encoding you wished to support, then implementing the CharsetEncoder and CharsetDecoder is a simple matter.

CharsetEncoders must implement a short list of methods. The main mapping logic is in the encodeLoop method. This is where you can put a mapping table or logic to implement the mapping of Unicode to bytes. The encoder reads from the CharBuffer until it is exhausted or the output ByteBuffer is full. If the CharBuffer runs out first the return value is a CoderResult.UNDERFLOW. If the ByteBuffer runs out first the return value is a CoderResult.OVERFLOW.

CharsetDecoders work the same way, but in the opposite direction (from bytes to Unicode characters) using a method called decodeLoop.

Handling Errors During Coding Operations

Both encoding directions support programmatic control over two types of error condition that can occur in conversions.

The first type of error only occurs in decoders: it is when the input is malformed for the encoding. For example, the five- and six-byte forms of UTF-8 are illegal. If the lead-byte is greater than 0xF4, the character is malformed. Similarly, lead-bytes less than 0xC2 represent non-shortest forms of UTF-8 and thus the byte sequence 0xC0 0x80, while it matches the UTF-8 algorithm (it maps to the Unicode character U+0000), is illegal.

CharsetDecoders provide a method onMalformedInput that can be used to set how to handle this error. The class CodingErrorAction defines the three options, which are REPLACE, IGNORE, and REPORT.

CodingErrorAction.REPORT causes the coder to return a CoderResult object of type malformedForLength(#int). Normally this result will cause receiving code (such as in InputStreamReader or the String constructor) to throw a MalformedInputException. This is the default behavior for both encoders and decoders.

CodingErrorAction.IGNORE just skips the malformed byte sequence and tries to recover with the next byte.

CodingErrorAction.REPLACE replaces the malformed byte sequence with the coder's replacement sequence. This sequence can be set programmatically by providing a replacement sequence to the replaceWith method (encoders take a byte[], decoders take a String). Most encodings have a specific sequence that is used built in. For example, ISO-8859-1 uses 0x3F (question mark) by default. Unicode (and thus most decoders) uses the code point U+FFFD as the replacement character. Encoders need to correctly implement the canEncode methods and should really also implement the isLegalReplacement method too to ensure that these features work correctly.

The second type of error is when there isn't a mapping between a particular character in a legacy encoding and Unicode (in either direction). For example, the Unicode character U+20AC (EURO SIGN) doesn't have a mapping in the ISO-8859-1 encoding. If you were to write a CharsetEncoder for this encoding, you would have the same options of reporting, ignoring, or replacing the character with the replacement sequence for the encoding. Unmappable sequences can happen in both directions, although most decoders map characters that are not in Unicode to the Private Use Area. (Yes, there are some characters that are not in Unicode yet.)

Coders in both directions must be written to handle each type of error condition correctly. In addition, coder writers should make the extra effort to validate input and output properly. For example, when creating a GB18030 coder, one might be tempted to return a CoderResultof type malformedForLength(#int) for the bytes 0x80 and 0xFF (currently undefined by GB18030), since they don't match the bit pattern for the encoding. These byte sequences should return a CoderResult of type unmapped instead.

Another consideration when processing malformed input is how to report the length of the malformed sequence. For example, the byte pattern for a GB18030 four-byte character is that the first and third bytes fall into the range 0x81 through 0xFE and the second and fourth bytes fall into the range 0x30 through 0x39. Thus the byte sequence 0x81308149 is malformed in the last byte. But 0x8149 also forms a valid GB18030 two-byte sequence. If the coder is set to CodingErrorAction.REPLACE, should the output be one replacement character? Or two plus a valid character? Or one replacement plus the single-byte character 0x49? These results depend on the encoding's rules and structure, as well as common sense. It is important to note that failing to "re-train" properly after a malformed sequence may result in all of the data after the malformed sequence being incorrect.

Data Structures for Static Mappings

ICU supports simple (single and multibyte) and EBCDIC stateful encodings via a state machine coupled with a multistage table (or "trie") data structure. Trie structures are useful for sparsely populated tables.

For example, ISO-8859-2 contains 255 characters. Unicode (potentially) contains 0x10FFFF characters. A mapping table from Unicode to ISO-8859-2 shouldn't require every possible mapping to be filled in, since most of the mappings are to the "not mapped" state. Making a table with 1,114,111 entries for an encoding that only has 255 mappings is a waste of memory (besides being slow to load).

The table for an ICU encoder uses a three-stage trie. The first stage contains the offset into the second stage table based on the highest eleven bits in the Unicode scalar value (expressed as a 21 bit integer). This divides Unicode up into 2047 segments. The second stage contains offsets into the third and final stage based on bits 4 through 9 in the scalar value (or 63 segments per stage 1 value). The third stage contains the actual byte sequence, which is found by adding the maximum number of bytes per character times the lowest three bits in the scalar value. Unmapped values on any level can be detected to increase efficiency.

The table for a ICU decoder is not multistage. It is a linear table that works by having an entry for every byte sequence in each encoding "state". If the first byte is directly mapped to a Unicode character sequence, then the character is returned, along with a status code. If the byte is not the last in the byte sequence, then a status code is returned and the next byte is read. The combination of the two bytes is read from the table (and this is repeated until a match is found).

Thus when decoding a multibyte sequence, if the first byte is 0x81 and state is 1 then a second byte of 0xe2 results in the lookup offset 0x81e2. A state of 2 on the first byte produces 0x102e2 ((0x81 * 2 * 256) + 0xe2). The state table may, as a result, have large gaps in it and be correspondingly larger in memory.

                try {
                    // this code moves the bytes up to form a scalar value.

                    code = mStateTable[mCurrentState * 256 + currentByte];
                } catch (ArrayIndexOutOfBoundsException aix) {
                    // logic for handling a malformed or unmapped sequence goes here
                }
                // a bit munching method to find the state machine state
                mCurrentState = nextState(code);
                if (isTerminal(code)) {
                   // state machine logic once a table entry has
                   // been located goes in here, essentially a giant switch statement
                }

Non-stateful encodings have only one table of byte sequences. The byte values (remember lead and trail bytes?) allow the character mapping to be determined. Simple stateful encodings also use a single table. Shift sequences have a special "state" in the state machine that increases the offset in the table.

Complex stateful encodings use separate data tables for each state. ICU historically didn't handle these encodings. Our solution turns out to be similar to ICU's: maintaining separate state tables and using logic to switch between state machines as necessary. This turns out to be generalizable.

To enable generic coders to work with data tables, we created a class called com.wm.g11n.io.charset.CharsetData, which reads the mapping tables when the Charset instantiates either of the coders and converts these data tables to in-memory values. Separate classes were written to encapsulate the trie structures in an encoder (EncodeMap) and state tables used by decoders (DecodeStateTable). Each of these classes encapsulates the specialized bit-munching used by ICU's existing state tables and mapping scheme for use with our Java port of their C code.

Data File Formats: The UCM File

Since our goal was to make a general purpose coder (that is, the same Java code for all encodings), the ICU data file format is important. ICU uses a format call the "UCM" file to contain the mapping information necessary for the trie table and decoder state machine.

We might have chosen a format based on Unicode Technical Standard (defined by UTS #22). However, because ICU is widely used and has a huge number of encodings available, we chose to adopt its format with only a small addition. Here are some fragments of a the mapping file for the JAVA implementation of the Shift-JIS encoding:

<code_set_name>               "java-SJIS-1.3_P"
<mb_cur_max>                  2
<mb_cur_min>                  1
<subchar>                     \x3F
<uconv_class>                 "MBCS"

<icu:state>                   0-7f, 81-9f:1, a0-df, e0-fc:1
<icu:state>                   40-7e, 80-fc

#
CHARMAP
#______________
<U0000> \x00 |0
<U0001> \x01 |0
<U0002> \x02 |0
<U0003> \x03 |0
<U0004> \x04 |0
<U0005> \x05 |0
<U0006> \x06 |0
<U0007> \x07 |0
<U0008> \x08 |0
<U0009> \x09 |0
<U000A> \x0A |0
<U000B> \x0B |0
<U000C> \x0C |0
<U000D> \x0D |0
# ... and further along in the file we find multibtye sequences:
<U8F62> \xE7\x80 |0
<U8F63> \xE7\x81 |0
<U8F64> \xE7\x82 |0
<U8F9B> \x90\x68 |0
<U8F9C> \xE7\x83 |0
<U8F9E> \x8E\xAB |0

Shift-JIS is a simple multibyte encoding. You can see the lead byte sequences in the first <icu:state> record and the trail bytes in the second record. Following this is the explicit mapping of each Unicode scalar value to a byte sequence.

We wrote a perl program to process these data files into a binary format (generating and packing the trie structure in advance for the encoder).

For two encodings, ICU uses a different design. This is because these encodings are very large and have long ranges of characters that map sequentially to Unicode. One of these encodings is actually a Unicode encoding: UTF-8. The other is GB18030 (a Chinese national standard).

For UTF-8, building a mapping table is not necessary. There is a strict algorithmic mapping between Java's UTF-16 and UTF-8. Writing code to support this encoding makes the most sense and, since bit-munching is very fast, it is the fastest coder of the bunch.

For GB18030, many of the characters require a mapping table, since they are mapped in the same manner that GB2312 mapped bytes to Unicode—in a seemingly random manner. Most of the byte sequences in GB18030, though, are mapped directly to Unicode using an algorithm. This means that a hybrid approach is called for: if a character isn't in the lookup table, then an algorithmic mapping specific to that encoding can be invoked. If there still isn't a match then an error can be emitted. This makes the GB18030 coders the slowest in our collection.

Most of the Java classes for dealing with encoding conversions try to convert the full input buffer all at once. The CharsetEncoder and CharsetDecoder classes provide a method, (encodeLoop and decodeLoop respectively) that is expected to process input until either the output buffer is exhausted or the input runs out. If writing a byte counted field, therefore, one might imagine that this code would work:

ByteBuffer aField = ByteBuffer.allocate(10);
CharBuffer valueToWrite = CharBuffer.wrap("Here is my string to write");
CharsetEncoder coder = Charset.forName("myCharset").newEncoder();
CoderResult result = coder.encode(value, aField, true); //force a buffer flush with true
// check the result

This code does work, from the point of view that no more than 10 bytes are written to the buffer. But it fails to meet the needs of a real application because the buffer ends with the last byte that can be crammed into it—it may even end mid-character generating "mojibake"! Padding bytes are not written to the field. Closing shift states may not be written.

Consider a typical multibyte encoding, in this case Shift-JIS:

            wmCharsetEncoder coder = (wmCharsetEncoder)Charset.forName("WM_JAVA-SJIS-1.3_P").newEncoder();
            CharBuffer source = CharBuffer.wrap("abc\u8d66\u8d66\u8d66\u8d66");
            ByteBuffer target = ByteBuffer.allocate(10);
            CoderResult result;

            result = coder.encode(source, target, true);
            System.out.println(result);

            target.flip();
            byte[] out = target.array();
            for (int x=0; x<out.length; x++) {
                System.out.print(Integer.toHexString(out[x] &0xFF)+" ");
            }
            System.out.println("");

            // we do this to ensure a clean coder
            coder.reset();
            source.rewind();

            // encodeToBytes is our proprietary extension
            result = coder.encodeToBytes(source, target, 0, 10, true);
            target.flip();
            byte[] out = target.array();
            for (int x=0; x<out.length; x++) {
                System.out.print(Integer.toHexString(out[x]&0xFF)+" ");
            }
            System.out.println("");          

Produces this output:

61 62 63 8e cd 8e cd 8e cd 8e
 S  S  S  L  T  L  T  L  T  L // whoops: half a character
61 62 63 8e cd 8e cd 8e cd 20
 S  S  S  L  T  L  T  L  T  P

In the example, the last "8e" represents a lead byte in the character 0x8ECD! The CoderResult object returned is OVERFLOW, but we now have a bad byte in the output and one left in the coder's output queue (which is cleared by the call to coder.reset()). In the implementation of wmCharsetEncoder, we provide the API extension shown, which can write the value to a byte buffer and force the resulting output to match the pattern expected in an EDI file.

There are additional complexities to consider with a simple stateful encoding such as IBM code page1399:

wmCharsetEncoder coder = (wmCharsetEncoder)Charset.forName("WM_IBM-1399_P100-1999").newEncoder();
coder.setPadByte(' ');
CharBuffer source = CharBuffer.wrap("abcd\u8d66\u8d66\u8d66");
ByteBuffer target = ByteBuffer.allocate(10);

// here is the default method call
CoderResult result = coder.encode(source, target, true);
target.flip();
byte[] out = target.array();
for (int x=0; x<out.length; x++) {
    System.out.print(Integer.toHexString(out[x] &0xFF)+" ");
}
System.out.println("");

// we do this to ensure a clean coder
coder.reset();
source.rewind();

result = coder.encodeToBytes(source, target, 0, 10, true);
target.flip();
byte[] out = target.array();
for (int x=0; x<out.length; x++) {
    System.out.print(Integer.toHexString(out[x]&0xFF)+" ");
}
System.out.println("");         

Produces this output:

81 82 83 84 e 51 51 51 51 51
 S  S  S  S i  L  T  L  T  L  // ran out of buffer mid-character
81 82 83 84 e 51 51 51 51 f
 S  S  S  S i  L  T  L  T o

Note that the state reset byte of 0x0F is written. If the buffer were one byte longer, a pad byte would be written. The regular coding loop can't do either because the shift back to the single byte code page takes place when a character with a different state is seen—and the next character in the buffer hasn't been read yet (and is multibyte in any case). The "state shift penalty" is dependent on the encoding used (there isn't one for a single byte encoding, for example).

A Few Flies in the Ointment

Complex stateful encodings make ICU's design harder to work with. Using the state table design works well as long as the state table is static. ISO-2022 uses multiple encodings with a three byte shift sequence to separate runs in a particular encoding. The ICU state machine actually has to reset and load a different table for each state change. Doing this with static tables, especially big ones, was not designed into ICU's code when we started this project. And coupled with support for algorithmic mappings it generated a lot of spaghetti code in our first implementation.

Implementing a Coder Loop

At this point we have the basics for making encoders and decoders. The various state tables or trie structures are implemented. All we have to do is write our coder classes.

ICU places all of their code directly into the coder loop. But Java object oriented code frees us from having to support every type of encoding from the same (very complex) block of code. In our implementation we wrote generic encode/decode loops, as well as our own utility methods in the classes wmCharsetEncoder and wmCharsetDecoder. The focus in these classes was implementing the methods and accessors for the byte count.

To support different encoding types and allow customers to write their own encodings with specialized needs, we created an interface called CharsetHelper. This interface defines the basic methods that the generic coder methods call to perform the actual conversion between bytes and characters.

By creating an interface, we allow different encodings to implement entirely different mechanisms for performing the conversion. For example, the SimpleCharsetHelper implements the state table lookup mechanism. This class is extended by EbcdicStatefulCharsetHelper since that collection of encodings uses exactly the same mechanism to perform encodings. The EBCDIC stateful coder merely has to implement handling of the shift-in and shift-out sequences. A different encoding like KEIS, which uses the same structure but different shift bytes, can be implemented in parallel. The Utf8CharsetHelper, by contrast, is entirely algorithmic.

The differences in mechanism are hidden inside the helper interface and the generic loop doesn't have to deal with them.

The CharsetHelper was implemented so that each method processes just one unit of input at a time. For example, the decoderDecode method reads just enough bytes to produce (at least) one Java char as output. It puts the characters it produces on the output pipeline and returns the number of bytes consumed. This allows the genericDecode loop in wmCharsetDecoder to determine how many bytes a particular character consumed. (Note that some decoding sequences produce more than one character, so a subsequent call to decoderDecode may return 0 as the number of bytes.)

The equivalent method (logically enough, it is called encoderEncode) processes one char, pushing the byte sequence to the output and returning the number of bytes written. This may include a shift sequence, of course.

The helper class also provides a number of utility methods. Each implementation of the helper has to be able to save its current state. This allows methods like encoderGetOutputSize to be called to determine how many bytes writing a particular char will require without affecting the current state of the encoder.

The helper class is associated with the specific encoding by extending the .ucm file format slightly. In most cases the helper class name can be determined from the encoding type, but specialized encodings and customer written helpers can put a special directive into the .ucm file. The helper class is instantiated by our class that extends Charset when creating the encoder or decoder:

            CharsetData data = CharsetData.load(m_MappingFile);
            String helperClass = data.getHelperClassName();
            if (helperClass == null || helperClass.length() ==0 || helperClass.equalsIgnoreCase("com.wm.g11n.io.charset.DefaultCharsetHelper")) {
                if(data.getConverterClass().equalsIgnoreCase("EBCDIC_STATEFUL")) {
                    helperClass = "com.wm.g11n.io.charset.CharsetHelperEbcdicStateful";
                } else {
                    helperClass = "com.wm.g11n.io.charset.CharsetHelperSimple";
                }
            }
            CharsetHelper helper = (CharsetHelper)Class.forName(helperClass).newInstance();
            helper.init(data);

One more small extension is required before we can implement byte-counting in both directions. When decoding characters from bytes there is a potential problem. The coder has to return any characters decodes, plus a CoderResult object (to indicate if we ran out of input or output), plus the byte count. CoderResult is declared as final, so we can't extend it. The problem here is: we don't always want to return a char in decoderDecode because we need to find the field boundary and it can fall between two shift sequences.

You may recall that length-delimited fields generally pair shift states so that the file may be randomly accessed. Consider two 10 byte fields next to each other in stateful multibyte record. The following example use the IBM stateful encoding IBM-1399_P100-1999:

iLTLTLTLToiLTLTLTLTo
0x0E 0xC7 0xCD 0xC7 0xCD 0xC7 0xCD 0xC7 0xCD 0x0F
0x0E 0xC7 0xCD 0xC7 0xCD 0xC7 0xCD 0xC7 0xCD 0x0F

The first character (0xC7CD) decodes to U+61B8 in three bytes. Then there are three more of U+61B8, for a total of nine bytes. The coder has no reason to read the closing shift (0x0F) waiting in the buffer until we ask for the next character. That read will product U+61B8 and consume four bytes of input (the closing shift, the opening shift for the next field, and the two-byte sequence for the character).

In other words, the "expense" of the shift state is always on the next successfully decoded character in the buffer (or a CoderResult.UNDERFLOW at the end of the input buffer, which we haven't reached yet). To circumvent shifting the expense of the shift sequence to the next character, we need a way to stop the decode process without returning a CoderResult and while still returning a byte count. We chose a slightly dubious design: we created an Exception (CharsetStateChangeNotification) that a CharsetHelper can optionally throw. This allows byte-counting code to catch the byte count before the next character is placed in the buffer.

Putting the Results Together

Finally we are ready to write specific methods in our CharsetEncoder and CharsetDecoder to perform byte counting.

For CharsetEncoder, we implemented a method for writing length delimited fields to an output buffer. Here's the method signature:

public java.nio.charset.CoderResult encodeToBytes(java.nio.CharBuffer in,
                                                  java.nio.ByteBuffer out,
                                                  int start,
                                                  int length,
                                                  boolean endOfInput)

This design is set up to write entire byte sequences with a fixed length to a ByteBuffer. If the input isn't long enough, a padding sequence is added to the end until length bytes have been written. The boolean endOfInput controls whether the encoder writes closing shift sequences (if required).

We chose this design over providing a character-by-character encoding method because the only case that can't be handled by the regular encoding loop is the one in which the output buffer length has a fixed size. Variable length fields with a maximum size can be handled by having the encoding loop check for available bytes in the output buffer.

For CharsetDecoder we implemented a method for reading the input buffer and decoding one char value at a time.

public char readChar(ByteBuffer in,
                     boolean stopOnStateChange)
              throws java.nio.charset.CharacterCodingException,
                     CharsetStateChangeNotification

We chose this design in part because we already had logic for breaking fields out of records based on a count. The readChar method coupled with a method for obtaining the number of bytes read could be inserted directly into existing code.

We also implemented a readField method that is the mirror of encodeToBytes in the Encoder:

    public CharBuffer readField(ByteBuffer in, int start,
       int length) throws CharacterCodingException

This method can be used to extract fields from a fixed length record that has been retrieved into a ByteBuffer. It's important to note that these fields must be standalone fields (that this method attempts to perform random access on the field).

Of additional interest when implementing the coders are the various bits of code for handling replacements and padding. In the CharsetEncoder class, if the CodingErrorAction for malformed or unmapped sequences is set to REPLACE, the coder will write the encoding's replacement byte sequence. Usually the default value is not changed, but it can be set to any byte sequence that is legal for the encoding by providing the bytes in a byte[].

Replacement and Padding Fun

The method for setting the replacement value checks to see if this is the case by calling isLegalReplacement. Now note that an encoder doesn't know if a byte sequence is valid for the encoding, since its tables map in the other direction. By default, isLegalReplacement instantiates a CharsetDecoder for the encoding and uses it to decode the bytes to test them.

This method has a basic design problem: in a stateful encoding, the replacement byte sequence is in one or another state. A shift sequence may be necessary in order to write the replacement sequence out and still have it be a valid sequence. For example, if we set the replacement sequence for the stateful encoding Cp930 to the space character (0x40 in EBCDIC) and the encoder is in the multibyte state when a character has to be replaced, a Shift-In has to be written before the 0x40 gets output. The code in the encodeLoop has to handle this state shift.

Interestingly, the test in isLegalReplacement has a built in assumption that the replacement bytes are in the default encoder state, if you think about it. In the following implementation, the bytes are at the start of a decode sequence, so no state bytes can appear before them. This limits what a valid replacement sequence can be to default state byte sequences:

        public boolean isLegalReplacement(byte[] repl) {
            try {
                Charset charset = this.charset();
                CharsetDecoder decoder = charset.newDecoder();
                decoder.onMalformedInput(CodingErrorAction.REPORT);
                decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
                decoder.decode(ByteBuffer.wrap(repl));
                // only gets here if no exception thrown
                return true;
            } catch (Exception e) {
                return false;
            }
        }         

In our implementation, our CharsetHelper actually writes the replacement sequence. This allows the helper's author to change the encoding state appropriately. It also allows for specialized behavior (such as writing same-state replacement sequences instead of the default sequence, if that is appropriate for the encoding).

In adding support for padding fields out to a specific length we adopted a different design. The method for setting a specific value is called setPadChar and takes a char argument. The padding sequence is converted to bytes internally. This allows the coder to check the ability to encode the byte sequence directly (by performing the encoding) rather than by instantiating a decoder. It does limit padding sequences to the byte sequence associated with a single Java char (and thus to a BMP character).

Using the Coders

Generally the coders described above work much like their built-in Java counterparts when used in the same manner. Using extended APIs we can read and write byte-counted fields with ease.

Conclusion

Writing support for byte counted fields in Java, much like supporting legacy encodings in the "old days", used to be a difficult task. Using Java NIO it is possible to build robust support for a wide range of encodings.

Glossary

Code Page
IBM and later Microsoft use this term to refer to a specific character encoding. Code pages have numbers, such as Microsoft's code page 1252 (the default Western European code page in Windows) or IBM's code page 037 (the EBCDIC equivalent of US-ASCII-7).
Legacy Encoding
Any non-Unicode encoding. Legacy encoding is the slightly derogatory way that Unicode supporters (self-styled Unicadetti) refer to "all encodings that are not Unicode related".
Lead Byte
The first byte in a multibyte sequence. Usually the bit pattern in a lead byte is used to distinguish it from single bytes.
Trail Byte
Non-lead bytes in a multibyte sequence. These are bytes that follow a lead byte (or another trail byte).
Basic Multilingual Plane (BMP)
The first "plane" of Unicode containing the code points 0->0xFFFF. This section of Unicode contains the majority of characters in common use in modern writing systems. Archaic and less-common characters are contained in the supplemental planes. BMP code points have the advantage of being represented by a single UTF-16 code unit.
Surrogate, Surrogate Pair
Surrogates are a range of Unicode code points reserved to allow the UTF-16 encoding to address supplemental characters (those outside the Basic Multilingual Plane). These code points are used in pairs: first a "low" surrogate from the range U+D800->U+DBFF and then a "high" surrogate from the range U+DC00->U+DFFF. The surrogate code points are officially not assigned Unicode character values. They are reserved code points and this can be a confusing distinction.

27th Internationalization and Unicode Conference

Berlin, Germany, April 2005