Page Encoding

Test #1: Check that the page is actually encoded as UTF-8

Steps:

1. Use the W3C Validator service and check the charset that it detects

2. Right click the page and check what the browser detects. See Figure 1.

Symptoms:

Page contains question marks where it should have non-ASCII characters. This generally happens because the page has been converted to a non-Unicode encoding from a Unicode encoding. Look for JSP pages that have a pageEncoding directive but not a contentType directive. Also look for taglibs that call setLocale or use the fmt: library of tags.

Page contains junk (mojibake, 文字化け) where non-ASCII characters should appear in the page. This is generally caused by the page being correctly encoded, but missing the contentType directive and/or the META tag. Check if changing the page encoding in the browser manually will correct this. (See Figure 2: Setting the encoding manually).

If changing the encoding manually doesn't work, then the problem can be several things. This includes:

Check the source of the damaged characters to figure out where the encoding mismatch happened.

Examples:

Input = "文字化け€àáßèëœ"

Conversion from Unicode to a legacy encoding. Proper conversions of this type result in individual characters being converted to individual question marks. The question mark is a substitution character. Basically it means: this character has no mapping in the target encoding.

Latin-1 (ISO8859-1) ?????àáßèë?

ISO8859-15 (Euro support) ????€àáßèëœ

Shift-JIS (a Japanese character set) 文字化け???????

Mozibake: literally "screen garbage." This is what happens when the wrong encoding is applied to a conversion. That is, you have bytes in one encoding, but use a different encoding to convert them to a Java (Unicode) string. Most of the time you won't get an error or conversion mapping problem when doing this.

Mozibake reading UTF-8 as Latin-1 文字化け€àáßèëœ

Mozibake reading SJIS as Latin-1 •¶Žš‰»‚¯???????

It goes both ways: reading Latin-1 as SJIS ????、珮゚齏ス

It goes both ways: reading UTF-8 as SJIS 譁�蟄怜喧縺鯛ぎテ�テ。テ淌ィテォナ�

The last example shows how it can be hard to tell. Most of the characters are at least Japanese-looking. If you don't speak Japanese, how can you tell? You should keep track of the characters being used for testing (their count and appearance) so that you can be sure that what you have isn't mojibake.

It is worth noting that you can see some funky garbage in the page (the Unicode replacement character U+FFFD) and that there are sometimes runs of characters with a lot of punctuation. But don't rely on that. Use known sets of characters to ensure that your testing gets what you sent.

Next Topic: Basic Input and Output