Learn To Type Japanese and Other Languages

A Brief Guide to Configuring Systems to Display Non-ASCII Text and Let You Type In Other Languages

Table of Contents

Introduction

Developing and testing internationalized software sometimes seems like a daunting task. Many developers and quality engineers view it as "something extra" that must be added to the development or test cycles and thus something to be avoided.

There is no reason to be afraid of basic internationalization testing. In most cases it is quite easy to configure systems and produce data that exercises the basic international use cases without adding a bit to the overhead of performing testing. In fact, internationalized testing often produces a better test of the overall functionality of the system, since non-ASCII data and non-English configurations expose more of the "edge cases" and erroneous assumptions in software design.

By following the instructions in this document, every developer and quality engineer can insert non-ASCII data and non-English configurations into the regular, day-to-day, environment without causing disruption or confusion--or a change in how developer or QA engineers do their jobs.

Configuring Your System

Display Problems

The most common problem users encounter is with displaying non-ASCII data. If you cannot see the data, it is difficult to determine if the characters are correctly handled or to test different scenarions.

There are just four things that can happen to non-ASCII data on the display. These are:

  1. Correct display using the right characters: 日本語
  2. "Tofu". These are hollow white boxes (hence the nickname 'tofu': the characters look like little bricks of the stuff) or black squares, one per character. The software knows what the character is, but doesn’t have a picture to show you. In most cases you can correct this by installing a font into your operating environment and JRE. (It can also be a bug: if there are hardcoded font names in the software and the names are not "logical" fonts, then installing a font won't help. See Logical vs. Named (Physical) Fonts for more information about this.)
  3. Question marks. This means “I converted data from one character encoding to a different one and the target encoding didn’t have the characters in it”. For example, if you convert Japanese data to ISO8859-1 (a Western European encoding), the target encoding doesn’t contain any Japanese characters. So you’ll get one question mark per Japanese character in the output.
  4. “Mojibake”. That's Japanese for “screen garbage”. This is what happens when you apply the wrong character encoding to some data. It might look something like: 譁�蟄怜喧縺鯛ぎテ�テ。テ淌ィテォナ�. Or it might look more degraded, like: 文字化け€à áßèëÅ. You can also get a mix of question marks and "junk": ????、珮゚齏ス.

To read more about how data can get messed up, see Frank Yung-Fong Tang's presentation Software Defect Patterns Which Break Text Integrity (which gives something like 17 patterns of mojibake).

Obtaining and Installing Fonts

As noted above, the solution to the 'tofu' problem is installing fonts into your operating environment.

Windows users can usually install a font called “Arial Unicode MS”, which comes with Microsoft Office (2000 and later). It contains something like 60K different characters from Unicode, or about two-thirds of the total Unicode 4.0 range (and most of the characters that you’ll encounter in the wild). It doesn't contain any non-BMP (supplemental) characters. For that you'll need some of the font sources listed below.

Code2000. You can also download the Code2000 font. These fonts may not be as attractive as Arial Unicode MS, but covers nearly every Unicode BMP character.

Gallery of Fonts. A good site for finding language or script specific fonts is hosted by Wazu Japan. This site allows you to actually see the fonts in question, along with what scripts they support and which blocks of Unicode.

Unicode and multilingual support site. This is Alan Wood's site, which has a lot of useful tutorial material, plus links to a vast array of fonts.

Logical vs. Named (Physical) Fonts

Operating and display environments often have two kinds of fonts that are available to programmers: logical and physical.

Logical fonts are, as their name implies, software constructs that represent a kind of generic font. These commonly have generic names like 'sanserif', 'serif', and 'monospaced' to distinguish them from specific physical fonts installed on the user's system.

Physical or "named" fonts are the discrete fonts installed on your operating system. These fonts are the real fonts and ultimately logical fonts refer to physical fonts. For example, the font 'Times' is often the physical font used for the logical font 'serif'.

The problem with named fonts is that each font has a specific range of characters or glyphs available in it. In some cases, as with Times or Helvetica, there may be many languages that a specific font can represent. However, most fonts do not contain glyphs for the full range of Unicode characters. This means that if you select a named font (in code, for example) you'll see tofu boxes for characters not in the font.

Logical fonts can have the same problem. However, most logical font systems are set up so that the system either dynamically (as in most browsers) or via configuration (font.properties, see above) search for fonts with the necessary glyphs available in them. This allows a logical font to cover a larger range than any specific named font can.

Troubleshooting Tips

IBM Connections screen in Japanese

One of the main complaints you may have when you're unfamiliar with a particular foreign language is that is it hard to work through problems if you can't read the screens. There are a number of ways you can approach this problem.

Two Screens

One way to work through problems is to run the target language and an English machine side-by-side. This let's you read the normal messages in English while working with the foreign language screens. Nearly all such screens are identical. This works well as long as you a) have two machines and b) have identical problems.

Switch to English

Another way to get past problems is to switch your configuration to English (using the same tips supplied just above for switching to a particular target language). This is useful for reading error messages in a third-party product (such as an Oracle tool or a troublesome driver with multiple language resources in it).

Try reproducing the error on an English configuration first. Some bugs are just bugs and have nothing to do with international configurations!

Use Babelfish

If a vexing message really is blocking progress, don't forget that you don't need 100% accurate translations in most cases. You can use an online service to translate the message and tell you what it going on or what a value is. One of the easiest to use is Babelfish. In many cases you can select the text in an error dialog and paste it into Babelfish to get an instant translation.

Configuring Mac OS X

There is really not much to do here. Mac OS X comes with all the Unicode support you could want. Support for many locales is available on your install disk: you should install every language and keyboard you can. They take only a little bit of space.

Configuring Windows

Configuring the Locale: Windows

Windows users can set the locale on per-session (login) or per-system basis. Due to limitations in Java, machines are limited to a per-machine locale.

Windows programmers can set the locale on a per-thread basis, which is a very powerful way to manage international preferences.

In this section we'll focus on setting up a machine to run in another locale.

Windows 2000

Windows 2000 Regional Options Control Panel

The Regional Options control panel allows you to add international support to a Windows 2000 system. In the above screen shot, I have installed the Windows 2000 MUI (Multi-language user interface) packages for several countries, so there is an additional list box (menus and dialogs), which you may not have.

If you have Windows 2000 you can install the MUI package, which comes on some install CDs or as part of MSDN. The MUI package allows you to run your system in an Asian locale but with English menus and dialog boxes.

Setting the default language

You can run any Win2K or WinXP box in any locale. To change your locale you need to change both your user locale and your system locale. In the Windows 2000 dialog above, click “Set default” and choose the locale you want. Then change the setting “Your Locale” in the main dialog (pictured above). If you installed the MUI, you can keep the user interface in English when you do this (otherwise you’ll see the language you select). If you haven’t installed MUI, the interface will stay in English. You will have to copy some files from your I386 directory. Once this is done you can just keep the installed files (Windows will prompt you). Finally you will have the opportunity to reboot (in most cases, although Windows XP no longer requires this in all instances).

Important: Using an MUI-based configuration is not the same thing as using a native language operating system. On Windows, for example, many of the registry and system settings on native language Windows are localized (translated). On English Windows--even with the appearance of Japanese or some other language--this won't be the case. For example, on real French Windows 2000 the "Administrator" account is actually named "Administrateur". Real systems certification should use real foreign language configurations. However: you can use MUI configurations during development to test basic internationalization formatting (numbers, dates, etc.) and locale awareness, as well as many encoding options.

Windows XP

With Windows XP things a slightly different. The control panel is called "Regional and Language Options" and it is organized differently. There are several things you must do to configure XP for another locale:

Setting up XP for text processing

Most copies of Windows XP have a wide range of locales and support files installed on them. However, if you are using the default installation of English (or Western European) Windows, you won't have the supporting files for Far East Asian languages (that is, multibyte locales like Japanese and Chinese) or "complex script" languages (such as Arabic, Thai, and languages from the Indian subcontinent) installed. These locales won't even show up in your list of available languages. So first you must install the supporting files. Click the second tab Languages and make sure both check boxes are checked. The system will prompt you for your Windows XP install CD if the files are not installed, so have it handy.

The Regional and Language Options Control Panel

Now that the supporting files are installed, you can set the locale. Click on the first tab Regional Options. Here Japanese is selected. Leave the location list box alone: it doesn't affect locale-affected operations.

The Advanced Tag

Then click the third tab Advanced. Select the same locale here as you did on the first tab (here French for France is selected: for it to match the first tab, Japanese would need to be selected). This controls information such as which code page (character encoding) is used by non-Unicode programs. Finally, make sure the checkbox to apply this information to all users (it is highlighted above) is checked.

The Advanced Tag

This should cause you to get the warning dialog shown above. This is what you want to have happen. Click 'Apply' or 'OK'. You may be prompted to install files (most machines have the XP install files locally or you can call IT for the disk). If you've had to copy files, you'll be prompted to reboot. Otherwise you are now running in the new locale.

How can I tell if it is working?

How to tell if it is working? Point at the clock.

Point your mouse at the “time” in the lower right corner of the screen. The date will display in the format for the locale you are running in.

UNIX, Linux, and FreeBSD

Editing Unicode text on a Unix box (or kissing-cousins such as Linux or FreeBSD) requires three things:

It's best if you are using a modern distribution of the operating system, one that has UTF-8 locales installed on it. FreeBSD users may need to install the "utf8locales" port. Almost all other systems come with UTF-8 locales out of the box. You can configure your machine to use a UTF-8 locale by default, or, before you start your editor, change the LANG environment variable (or set it in your shell's initialization file).

Second you'll need a shell tool. If you access your machine from Windows or a Mac, this might be SecureCRT, putty, terminal, or something like that. For configuration of these, see this section below.

Configuring the Locale: UNIX and Linux

Unlike Windows systems, UNIX allows the user to set the locale on a per-process basis only. When the locale is set, everything in that process context has the same locale set. Processes and threads spawned by the process inherit the locale.

For programmers, this is more limiting, since setting the locale in one part of the code affects (instantly) all other threads in the same process. However, it makes it easier to test software, since the user doesn't have to configure the entire machine in a specific locale.

The locale used by a UNIX process is read from a set of environment variables: LANG and its relatives LC_*. You can see the current settings by using the command 'locale':

The locale command

The command 'locale -a' lists all of the installed locales on your system:

locale -a

On my SuSE 9 system I have 271 of these installed. You'll note that many of the locales end with an encoding after the dot: zh_CN.gb18030 for example. This controls the character encoding used by the shell and/or display. Locales that end in ".utf8" or ".utf-8" use the UTF-8 multibyte encoding of Unicode. These are generally the best choice for really broad testing. However, when testing Far East Asian languages, you should include "legacy" or non-Unicode encoded locales. In many cases, keyboard input is tied to legacy encodings.

Locales with no encoding extension generally use some legacy non-Unicode encoding. Exactly which encoding is used is vendor-and-locale specific.

On UNIX, the Japanese Shift-JIS encoding is often called PCK (for "PC Kanji": Shift-JIS is what Windows systems use). It is at least a common to use the EUC encodings for Asian languages (EUC stands for Extended Unix Code, after all). The encoding, shell, console, display, and browser all interact with the encoding in order to present data correctly. You also have to manage the difference between the encoding of various files and the display. For example, webMethods log files generally use UTF-8 as an encoding by default. Displaying the files in a non-UTF-8 shell (by viewing them with the 'vi' editor, for example) may result in mojibake on the screen, even though the file is fine.

To change your locale in UNIX for a single process, type LANG={new locale}; {start command}:

Use LANG to alter the language.

Most UNIX systems also offer the option to login with a specific language and encoding setting. This is very useful for setting up a complete environment (where all processes run in a specific language) and may be required to get consistent behavior.

Note that you can use the reverse of the first trick to get a specific piece of software to run in English! (LANG=en...). On UNIX, incidentally, the default U.S. English-like locale is called either "C" or "POSIX" (some systems have both). The root user generally uses this locale (and system systems will not display 8th-bit set data in the console for security purposes).

Configuring and Using Keyboards (Unix and Linux)

Installing and using non-English keyboard maps isn't as easy in Unix as it is in Windows. It is harder to switch keyboards, especially in a terminal session (where most of the interesting testing takes place). Luckily Unix systems provide the ability to type accents on letters using "dead" keys or the <META> key.

Detailed information on this is located in this FAQ: Console and Keyboard HOWTO. Some keyboard layouts are easy (Spanish). Others more difficult, such as Japanese. Let's start with "compose" keys.

For Linux there are basically four modes:

Japanese (and other Asian languages) are more difficult. As with Windows, you need a separate little program (the IME) to intercept keystrokes and present you with choices of kanji characters. All Unix IMEs come in two parts, the IME (the program running in the shell that you interact with) and the dictionary part (which is a daemon process that looks up your choices). One such combination, for example, is kinput2 and Canna. kinput2 is the IME. Canna is the dictionary. You can run kinput2 with another dictionary (Wnn is a popular alternative, for example).

Here's a link to the SuSE documentation: kinput2-Canna

You may also be able to use your telnet program on Windows to your advantage. For starters, you can usually use your Windows keyboard to send the appropriate characters for you (you already know how to type those characters thanks to the other parts of this document)! Alas, you need a telnet program that knows about encodings. Otherwise what you'll get are bytes going over the wire to your Unix box and being interpreted as who-knows-what characters.

Personally I use SecureCRT, which also offers SSH connections. SecureCRT has a programmable keyboard of its own, so you can set it up to send whatever character or keycode combination you wish on the wire (it's encoding aware too).

SecureCRT soft keyboard

Configuring Popular Unix Editors

Emacs

Make sure you have ports/libgnugetopt installed before you compile. Emacs 22.1 is the earliest with complete support for displaying and editing Unicode.

If you run screen, do so with the -U option (this sets screen to use UTF-8 as its default encoding). You should also include the following lines in your ~/.screenrc file:

# Set default terminal and character set to utf-8
defuf8 on
defcharset utf-8

To get less to run in UTF-8 mode you'll have to set the LESSCHARSET environment variable to 'utf-8'. Add the following line to your ~/.bashrc file: export LESSCHARSET=utf-8.

Finally, you're ready for Emacs. The main instructions for configuring Emacs are located here:

Basically, you need to tell Emacs all about the encodings you intend to use:

  (require 'un-define)
  (set-language-environment 'utf-8)
  (set-default-coding-system 'utf-8)
  (set-terminal-coding-system 'utf-8)

Be careful of file-name-coding-system. This is the encoding used in your local file system. If you change this and use non-ASCII characters, you may have a difficult time manipulating the file names later. (In general it is good practice to avoid non-ASCII filename on Unix systems.)

Of particular interest in Emacs: the glyph used for "undisplayable" characters (tofu) is the question mark, or, the same character as used to represent a character replaced during a faulty character conversion. This makes it difficult to tell if the buffer has been damaged by a character conversion process (or perhaps it doesn't use UTF-8 as its encoding) or if the font just won't display it. If you see question marks, you should stop and verify the file before proceeding.

vim, gvim

vim (and its relatives, such as gvim) generally work well with UTF-8 text, provided you can run them in a UTF-8 locale. They provide a method for setting the file encoding, in cases where you wish to work with text that is UTF-8 encoded in a non-UTF-8 locale:

 :let &termencoding = &encoding
 :set encoding=utf-8

Note that the foregoing sets your terminal encoding (used for display) to the default value for your session. This allows you to interact with the keyboard, including IMEs, in a nicer manner. Ideally you should configure your shell to use UTF-8. You may also need to set your display font in gvim:

 :set guifont=-misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1

This is all described in detail in the Vimdoc.

Shells, Terminal Emulators, and Tools

SecureCRT and Putty

If you log into your Unix box from a Windows and Mac, you'll need to configure your telnet or other tool to display the proper encoding. Ideally, you should use the Unicode UTF-8 encoding for your shell session.

SecureCRT encoding configuration

To configure SecureCRT, you need to go to the "Appearance" item in the "Session Options" dialog box. Here you set the "Character Encoding" used to interpret bytes displayed in the shell to the one used by your shell (in this case, UTF-8. You'll also need to set the font to display the characters correctly by clicking the "Font" menu.

SecureCRT font selection

Note that the fonts in SecureCRT require you to choose an encoding to match the font. If you need to edit Japanese text, you need to choose a Japanese font (such as Mincho) and the JAPANESE script. If you want to edit Traditional Chinese instead, you'll need to choose a font such as Ming LiU with the corresponding CHINESE_BIG5 script. You'll only be able to see a single script at a time. This even applies to scripts such as Cyrillic and Latin.

Putty is very similar to SecureCRT in functionality (not surprising: they come from the same vendor). The encoding configuration is in the "Translation" screen (under "Window" in the navigation panel), as shown in the screen shot. The font setting is under "Appearance".

Putty Configuration

Mac OS X Terminal

The terminal program on Mac OS X does fonts by itself (you can configure the default font, but it will find an appropriate font for any glyph that the default font doesn't have). You still need to configure Terminal to use UTF-8 as the shell encoding:

Mac OS X terminal configuration

Kermit

Kermit also offers a scriptable shell and provides ways to convert encodings and use different display encodings, as well as for reading/writing/transcoding local files and during file transfers.

There are Windows and Unix versions of Kermit:

Eclipse

The Eclipse IDE is popular, especially with Java programmers. It can be configured to store entire projects or any part of the project in UTF-8 (or some other encoding). Usually it is best to use UTF-8 consistently. In version 3.1, to set the default encoding used by your copy of Eclipse, choose "Preferences..." from the "Window" menu. On the "Editors" item in the list, set the encoding to UTF-8:

Eclipse Editor Preferences Dialog

You can also override the encoding for a specific file from the "Preferences..." dialog in the "File" menu:

Eclipse File Preferences Dialog

Other Platforms and Environments

Oracle Clients and Other Components

Sometimes other components in your system have an effect on your testing. There are real differences between, for example, real Japanese Windows as sold in Japan and English Windows configured to run with Japanese settings. One of these differences is the way that third party components are installed and act. A common example would be database client software. If you install the Oracle client on your English Windows configuration, then the settings for your Oracle "home" will reflect the configuration of Windows when you do your install.

One of these settings is the NLS_LANG parameter used when creating Oracle sessions. If you install on U.S. English Windows, the value of this in the registry is likely to be "AMERICAN_AMERICA.WE8ISO8859P1" or "AMERICAN_AMERICA.WE8MSWIN1252". If you connect to a Japanese Shift-JIS or EUC database, the client software will still be using this setting. The Oracle database will convert any data you select from the database to that encoding and you'll get odd-looking results.

Similarly, any SQL*Plus statements that modify text in the database are likely to cause problems. For example, if your NLS_LANG is set to MSWIN1252 and you run SQL*Plus from a (Japanese configured) command line, it may look like you are correctly inserting Japanese characters into the database: "SELECT" statements that you run return the right looking values on that machine. But the real values in the database have been mapped from your NLS_LANG setting, so the results in the database are junk.

To overcome this you must set your NLS_LANG to match the current code page you are using in Windows (note that this is not necessarily the same thing as the database's native encoding!). To do this, find your Oracle HOME entry in the registry and modify your NLS_LANG setting appropriately.

Oracle HOME in the Windows Registry

Each additional component may display difference in how you configure the settings to get "proper" behavior when testing. Knowing which settings to change or use depends on the specific component and language combination you need to support.

Installing Windows Keyboards

Install Some Keyboards: Windows

Installing Keyboards in Windows 2000

Next, be sure to install some keyboards—you can install keyboards on any language or locale operating system. The keyboard support in Windows allows you to type non-ASCII characters with ease. I recommend that every developer and QA person install the following keyboards:

This last is a customized US English keyboard. If you hold down the right Alt+Ctrl (or hit the AltGr if you have a European keyboard), all of the keys on the keyboard type special accented characters of one sort or another. The top alpha row (QWERTYUIOP) contains combining or “dead letter” accents. You type that key and then the letter you want to modify and you get accented letter.

The Chinese keyboard installed...

In the screen capture above I have installed the Japanese and “Chinese (Traditional) Unicode” keyboards. (I’ve installed a bunch more you can’t see, of course). Set the Japanese keyboard as your default. Then you’ll be able to type Japanese whenever you want to. Mostly the keyboard allows you to type English (this whole document was typed with the Japanese keyboard).

When you’re done a new item appears in your Taskbar:

The keyboards menu

In the following screen shot you can see that the French standard keyboard is selected. Hovering the mouse over the [FR] gives you a tool tip showing the keyboard locale.

The French Keyboard is active...

The French layout, as with other non-English keyboards, is significantly different from the US layout. In this case you’ll find that there are accented letters along the top (number keys) row with this keyboard and some "dead keys" that produce accents when typed in combination with other letters. Still, it is easy to get lost on this keyboard. If you need to see the layout visually, Microsoft has a nifty page that shows you where the keys are and what various keystroke combinations do on the GlobalDev site.

Typing Other Languages

Japanese

Japanese IME

You might have noticed on a Japanese machine that this is a little paintbrush over the Japanese “rising sun” symbol—that’s the Japanese keyboard (the “IME” or Input Method Editor, a program for typing Japanese). Changing your keyboard to Japanese brings up an additional little window. This window controls the Japanese IME, or keyboard input program.

On Windows 2000 and later, the Japanese IME can be installed on any language operating system. Java folks: older JDKs (pre 1.3.1) had an annoying bug that prevented you from typing Japanese into a java program unless you set your system locale to Japanese, but with modern Java versions this bug is fixed and you should be able to type Japanese (or anything else) at any time.

Many applications are enabled to take IME input directly (called "on the spot" editing). Others may cause a little extra window to pop up where you do your editing of the Japanese. When you're done the text is blasted at your program. Internationalized programs really should support on-the-spot editing.

Otherwise, on Japanese Windows NT 4.0 or on any language Windows 2000, the IME looks exactly the same and can be used as follows.

Japanese IME modes menu

The IME has six settings that can be used to type the same “word” in different ways. The settings are:

It’s important to note that each Asian language has its own IME (or even several, in the case of Chinese).

The Japanese IME has a nifty tool called the “soft keyboard” that allows you to compose all sorts of special sequences. Here’s the button to push:

The soft keyboard

Explore the little keyboard-like drop-down box (upper left corner) to see some interesting options.

Use the Japanese keyboard as your regular keyboard. In “direct input” mode, it works exactly like your U.S. English keyboard. So using it as your regular keyboard won’t interrupt your work, but you have instant access to Japanese input at all times.

Now that you are a little bit familiar with the Japanese IME, let's type some Japanese! When you need to type some Japanese, all you do is switch modes, either by clicking on the little floating toolbar or typing Alt+` (that’s the character above your TAB key on most USA keyboards).

On a U.S.A. or European layout keyboard, Japanese is typed by typing the “romanji” (English) letters for a word exactly as it is pronounced. While there are some tricky aspects to this, most people can easily master typing simple words such as “nihongo”, which means “Japanese”, or Japanese company names like 不実 (fujitsu), 豊田(toyota), 三菱(mitsubishi). How did I know how to type those Japanese characters? I typed “fujitsu”, “toyota", and “mitsubishi”.

Japanese typing is modal. First you enter the phonetic characters (called 'kana') that make up a word, and then you transform each cluster of kana character to the Chinese ideograph(s) for the particular word. The Chinese ideographs are called (in Japanese) "kanji" and you may have heard the Japanese language called that from time to time. Kanji, as you'll see, is only part of the way that Japanese write text.

There are two kinds of 'kana'. The first kind, hiragana, is used for writing words of Japanese origin. The second kind, katakana, is used for writing words of mostly foreign origin. Hiragana and katakana are both phonetic writing systems. That is, each character represents a specific sound. Hiragana and katakana have exactly the same set of fourty or so sounds that they can represent. (The table that follows shows the complete range of romanji syllables represented by kana. Note that the Japanese do not distinguish certain sounds. For example, you can type "mojibake" as "mozibake" and get the exact same kana)

aiueokakikukekosashisuseso
tachitsutetonaninunenohahifuheho
mamimumemoya  yu  yorarirurero
wawon    gagigugegozajizuzezo
dadidudedobabibubebopapipupepo

Kanji characters are an ideographic writing system borrowed from the Chinese (the more general term for these character, from the Chinese, is Han). Japanese kanji are written in a distinct style from Chinese Han characters. However, both share basic principles, shape, and form. In ideographic writing systems, the character or characters in a word do not represent the sounds of word (as with kana, or as with alphabets and abugidas and such, aspects of the structure of the spoken language). There are upwards of 50,000 Han characters in Unicode 4.0 and the the number continues to grow as historical characters are added with each release. This has interesting implications for software processing of text, which I won't delve into here.

Here's how to type Japanese, then:

  1. Change to one of the Japanese input modes (either hiragana or katakana, see above) from “Direct Input”.
  2. Typing Japanese
    Type the word or phrase that you would like to enter. Both hiragana and katakana have about 40 sounds—the same forty in each—so if you limit yourself to typing these sounds you’ll make “real” Japanese characters.
  3. Hit space.
    After you’ve entered a good bit of kana, press the space bar. This switches to kanji selection mode. Hitting the space bar cycles through various kanji representations of the kana you’ve entered. If you would like the kana to be entered, just hit return (or use the right arrow key to move to the next kana "cluster"). If you want kanji characters, hit the spacebar until the ones you want are displayed or until you get a pick list to choose the kanji from.
  4. Uppercase gets Wide ASCII
    Typing all uppercase gets you wide (multibyte) English: MITSUBISHI (typing uppercase gets you English immediately). Multibyte English characters are real non-ASCII characters and provide a basic test that you can read. The characters are often called “wide” characters because they are “wide” (in that they take two bytes each in Japanese legacy encodings) and because they are physically wider on the screen (traditionally they take two screen positions in fixed width fonts). Here’s some wide ASCII: MITSUBISHI. The cool, hip, in-crowd word for "wide" characters, by the way, is zenkaku.
  5. Or just one uppercase...
    If you type just one uppercase letter (Mitsubishi) you get wide ASCII also.
  6. Hit return when done. The text now becomes part of your document and the cursor moves to the end of what you’ve just typed so that you can enter more text.

So far you’ve learned to type simple Japanese words (like famous Japanese company names), nonsense Japanese phrases, and wide ASCII. Practice this and you’ll be testing with Japanese characters quickly.

Of course, sometimes you need to test specific cases that are interesting. What characters are interesting to type?

Trailing byte of “backslash” in Shift-JIS: ソース (“so-su”—yes, that’s a hyphen in the middle). The multibyte characters in the common Japanese encoding Shift-JIS have lead bytes (that is, the first byte in a double-byte sequence) in the range of 0x80-0xFF, and trailing bytes in the range 0x40-0xFF. The backslash characters—a common programming escape, obviously, is in that latter range. The katakana character ‘su’ ends in backslash.

Zenkaku or Wide English: 123.45 ABCdef. The program should interpret numbers as numbers and English text as itself (some programs don’t).

Hankaku or “Half-width” katakana: ソースチノマサシ These characters are available only on the PC or on UNIX in the PCK locale (Shift-JIS encoding). They represent the same value as their “wide” counterpart and should be passed and displayed without data loss. The key here is that some systems use legacy (non-Unicode) encodings that don't support half-width katakana. For example, UNIX systems that use EUC and many legacy database encodings don't. So you could "lose" data in these configurations. Half-width characters are called 'hankaku' in Japanese.

The Chinese Unicode Keyboard

Typing Chinese requires at least a rudimentary knowledge of the language and keying concepts, so generally it is harder to master testing using the Chinese keyboard than it is to type in Japanese.

The Chinese Unicode keyboard is an exception to this. It allows you to type the hexadecimal for a Unicode character and have the IME type the character for you. That way, if you know the character number (the Unicode Scalar Value is what it is called), you can type the character. Use this keyboard to type specific Unicode characters.

Typing EUROTyped EURO.

For example, I happen to know the the Euro symbol is Unicode character U+20AC. My English keyboard doesn't have a key for it, but I can just type "2", "0", "A", "C" into the Chinese Unicode keyboard and there it is: €. You can use this keyboard to enter otherwise difficult to type sequences such as combining marks or surrogate pairs (for supplemental characters).

The Korean Keyboard

Typing Korean characters (that is, nonsense Korean) is very simple to master, but the keyboard layout and language structure require a good bit of memorization for non-Koreans to master it. To type nonsense Korean, switch to the Korean keyboard and type a left-hand letter, right hand letter, left hand letter (for example “akd”). Note how Korean characters are composed as you type.

Korean is an interesting language for a number of reasons. Although the script (called Hangul) looks similar in style to Han ideographs used for Chinese and Japanese, it is a completely different in how it works. Hangul was actually invented by a Korean king and is a phonemic system. The shapes of the characters are said to resemble the positioning of lips and tongue when speaking the sound. Each Korean character is composed of two or three parts that form the syllable and Unicode encodes both the separate shapes (as combining marks) and the complete set of possible combinations.

The Serbian Keyboard and Cyrillic Text

Languages such as Russian, Ukrainian, and Serbian are often written in the Cyrillic script. The Russian and Ukrainian keyboards feature layouts which are natural for speakers of those languages, and thus more difficult for Westerners to pick up. The Serbian language keyboard, by contrast, is laid out similarly to the QWERTY keyboard (that is, the equivalent Cyrillic letters, or their close transcriptions, are on the same keys as their Latin counterparts).

For example, to type "русски", you type the letters "russki" on your keyboard. (Some letters which have no Latin counterpart are mapped to various punctuation keys and a few necessary for Russian or Ukrainian are not present on the Serbian keyboard.)

Bidirectional and Complex Text

Complex selection in Arabic (bidi) text.

Some languages (notably Arabic and Hebrew) are read from right-to-left instead of left-to-right (as English is read). Since embedded text and numbers often follow English-like rules even in these languages, they are called bidirection languages or Bidi for short.

Bi-directional text has all sorts of interesting issues associated with it. It implies that the entire screen should be mirrored, left-for-right, for example. Progressions, arrows, and other navigational things must be reversed in order to have the same meaning.

Typing bidi language can be confusing, since the directionality of the text is governed by a complex algorithm. You can actually select discrete blobs of text that aren't touching on the screen (see screenshot).

Other languages, such as Arabic, Thai, and Indian subcontinental languages (like Devanagari, Hindi, Kanada, etc.) are "complex" scripts. In a complex script, the characters change shape as you type them (the character shape depends on the characters that preceed and follow them).

Other Approaches to Text Input and Testing with Non-ASCII Text

The Character Map Utility

Character Map

One nifty tool that you can use on any Windows machine is the “Character Map” utility.

Character map lets you select characters from anywhere and assemble them into strings to paste into your application. In addition, it shows you the local encoding and Unicode values for each character (circled above).

In the “advanced view” (shown) you can select a Unicode font and Unicode character set to see and use all of the Basic Multilingual Plane of Unicode. Or, in a particular font, point to a specific character to see the Unicode Scalar Value (or just copy-and-paste the character).

Accessible Keyboard

Another tool built into Windows XP is the "On-Screen Keyboard" under "Accessibility" in your "Accessories" menu. The on-screen keyboard does what you might imagine; it shows you a picture of your current keyboard. It won't show you IME keys and the like, but it will show you the layout of non-QWERTY keyboards. For example, here are screen captures of the French, Thai, and Serbian (Cyrillic) keyboards:

Japanese keyboardJapanese keyboardJapanese keyboard

Web-based Character Picking Utilities

Richard Ishida has a variety of Web-based character pickers. These make it easy to select specific characters (in the same manner as the Character Map tool). All you need is a Web browser.

Unicode character pickers (home page)

Pseudo Translation

Pseudo Translate Tool

In addition to typing characters into programs, you'll want to generate data that is useful for testing purposes. The Pseudo-Translate tools provide the ability to generate non-ASCII data from ASCII strings. One generates text in a dialog box for cut-and-pasting. The other generates properties files that are pseudo-translated (for use when finding hardcoded strings).

An HTML page that uses the pseudo-translator is on the web here: pseudo.jsp

There is also a paper about Pseudo-Translation on this site (in PDF format). It's The Theory and Practice of Pseudo-Translation.

native2ascii

Another way to generate specific Unicode characters without a keyboard is to use the widely available native2ascii tool included in the Java JDK. First, create a file with the text you want in it. The non-ASCII characters will be represented as \u@@@@ escapes. For example, "€" [the Euro symbol] (U+20AC) is represented by the string \u20ac.

Note that supplementary characters (that is, those above U+FFFF in Unicode, which require a surrogate pair in UTF-16 or four-bytes in UTF-8) must be represented as a surrogate pair. Thus U+10000 is \ud800\udc00.

Now you can create your UTF-8 file like this:

native2ascii -encoding UTF-8 -reverse < infile.txt > utf8file.txt

Appendices, Links, and Other References

Useful Links

Here are some other links with similar content or of interest to folks using this page.

Appendix A. Addison's font.properties file

For older Java runtimes (pre 1.5) you used to need to modify your JRE: Most Java runtimes did not come configured to display all characters in all languages. You had to install the fonts onto your system and then modify the font.properties file to force the JRE to recognize the fonts. This can be quite tedious, but it is somewhat logical. The easy way to do it on Windows is to copy mine.