Working with Encodings in C Strings

The C programming language (and its close relatives) provide two means of processing text. The first mechanism uses the char data type for manipulating multibyte text and the second is the wchar_t type for manipulating "wide" text.

When programming in C or C++ you have to make design judgment calls about how to handle text. Unlike programming languages such as Java or C#, C gives you choices for handling textual data on a byte level and it has an ambiguous relationship to Unicode (ISO/IEC 10646). Dealing with these issues requires some additional knowledge about the data types involved and the capabilities of the C libraries. If you want a solid, portable Unicode solution (and who does not?) you may have to use a library such as ICU.

Basically the char data type really represents a byte. Wherever you see char in this document (or in the C Standard or any of its friends), you should mentally translate to the word byte. In C, a string is merely an array of chars terminated by a NULL char (are you translating mentally?) and a char * is a pointer to an array of bytes. C programmers, dating from the seminal K&R book, are taught to manipulate the pointer value by adding to or subtracting from it. This works in the world in which "1 byte == 1 chararacter", but it breaks down once a character can have two or more bytes (and fails altogether in stateful encodings).

A particular problem in most C documentation is the failure to define what 'multibyte' is and what multibyte encodings are, how they work, and what the range of issues might be. For a (Java-centric) look at character encodings and the complexities involved, see: Are We Counting Bytes Yet?.

The main problems C programmers encounter are pointer arithmetic or functions that are not multibyte aware. For example, assume that you're writing a Windows program for the Japanese locale, where the native encoding is flavor of the encoding called Shift-JIS (a Japanese multibyte encoding). In Microsoft's version of Shift-JIS (also known as code page 932), characters can take one or two bytes each. The two byte characters each begin with a "lead byte" in one of two ranges (0x81-0x9F and 0xE0-0xFC). The trailing bytes can fall into the range of 0x40-0xFF. Notice that the latter range includes all of the ASCII letters, as well as certain characters of interest to programmers, such as 0x5C (the REVERSE SOLIDUS, aka backslash) Consider the following code fragment running with this encoding:

   const char *src = "\x82\x4F\x82\x50\x82\x51\x81\x5c";
   const char *tgt = "O"; // O == 0x4F
   int loc = strstr(src, tgt);

Hint: the value loc contains the position of the trailing byte 0x4F, not NULL, as a multibyte aware function would have returned.

Certain things work without awareness of multibyte. For example, a string equality test will work as long as the encodings are the same: the bytes are the same or they aren't. Many things that a programmer will want to do with a string or pair of strings don't require knowledge of the encoding. For example, if you are searching for a SQL keyword in a long string (let's say it's "SELECT"), then the strstr function will find it far faster than writing a multbyte aware function. On the other hand, if you're looking for the semi-colon at the end of the statement, then multibyte awareness is crucial: you want to ensure that the semi-colon char "\x3B" you find is really a semi-colon character and not part of some multibyte character (as it might be in certain Korean encodings).

Sometimes developers want to make assumptions about the structure of "all encodings everywhere", not expecting certain edge cases to crop up. As a general rule, the fewer assumptions you make about the structure of an encoding in general-purpose code, the better off you are. By "general purpose", I mean "code that should work in any runtime environment". If you are developing specific code for a specific encoding, then you can begin to make assumptions about the encoding structure; this kind of code is rare, though.

This document deals with the types of operations that present trouble and how to work around them. It does not deal very deeply with encodings, character sets, and the like.

Encoding Basics

C uses the char type to access textual data. This type is an integer type that represents an 8-bit byte value. Text that uses encodings that need more than one byte per character requires special handling. You can't just do:

    for ( tp = string; *tp != '\0'; tp++ )
        *tp = tolower ( *tp );

This gets you into trouble because the pointer steps right into the middle of a character (in our example above, the trailing byte 0x4F becomes 0x6F). It can also be a problem because case mappings sometimes produce a longer output sequence than input sequence--that is, they may produce more characters than the input contained. A famous example is the German use of the sharp-s character (called an estzet in German). The uppercase equivalent of U+00DF (ß) is "SS", that is, two capital letters "S". The code fragment above overwrites the next character in the input buffer with a capital S. If the next byte in the buffer were the null at the end of the string, we have a problem. We need to avoid this by using appropriate functions that are "multibyte aware" and generally available.

Here is a sample program that calls setlocale and then walks across a string to print out character values. :

#include <stdlib.h>
#include <string.h>
#include <locale.h>


/*
 * Prints the (wide character) values for each character in the string 'example'.
 * On Windows this is the UTF-16 code point value for the character. On other platforms
 * this may vary (it could be UTF-32 or it could just be multibyte cheese). In all cases
 * you will get one integer value for each logical characters according to the current
 * LC_CTYPE and its associated encoding.
 */
void printmbvals(const char* example) {
   unsigned pos = 0;
   while (pos < strlen(example)) {
      wchar_t t;
      int result = mbtowc(&t, example+pos, MB_CUR_MAX);
      if (result == 0) break;
      printf("%x ", t);
      pos += result;
   }
   printf("\n");
   printf("strlen is : %d\n", strlen(example));
   printf("_mbstrlen is : %d\n", _mbstrlen(example));
   return;
}

int main(int argc, char* argv[])
{
	const char* example = "hello, world\x82\x4F\x82\x50\x82\x51O";

	for (unsigned x=0; x < strlen(example); x++) {
		printf("%x ", example[x]&0xff);
	}
	printf("\n");

	printmbvals(example);
	printf("dbs = %d\n", deathBySJIS(example, "O"));
	char *locale = setlocale(LC_ALL, "");
	printf("called setlocale and got this response: %s\n", locale);

	printmbvals(example);
}

Here is the output on a Japanese version of Windows:

68 65 6c 6c 6f 2c 20 77 6f 72 6c 64 82 4f 82 50 82 51 4f
68 65 6c 6c 6f 2c 20 77 6f 72 6c 64 82 4f 82 50 82 51 4f
strlen is : 19
_mbstrlen is : 19
dbs = 13
called setlocale and got this response: Japanese_Japan.932
68 65 6c 6c 6f 2c 20 77 6f 72 6c 64 ff10 ff11 ff12 4f
strlen is : 19
_mbstrlen is : 16

What's My Intention?

If you've already written some C or C++ code that works with strings or character arrays, then you'll need to go through and fix or "enable" the code. Enabling is not done by simply replacing all of the str* calls with mbs* calls (for one thing, mbs calls are not portable--there is only limited support for multibyte in Standard C 99). You need to look at what the code is doing and make a decision about whether the operation requires multibyte awareness or not. (If you are Unicode enabling the application you have a different problem.)

The programmer's intention in many cases is more important than the actual function being called. There are counter examples for every case that follows. When determining the programmer's original intent in calling a specific function, you have to perform the mental translation from "char" to "byte" and think about what happens when you (for example) look at a multibyte buffer a byte at a time.

For example, a call to strchr is questionable: that's because the byte you are searching for can be part of a multibyte encoded character, as noted above. But if you know that both the source and target buffers are ASCII-only, you might keep the call to strchr instead of replacing it with or writing an implementation of mbschr. Similarly, a call to strncmp is questionable if you're looking at two random buffers, but if one of the buffers is a string literal with a command value in it, then you might not have a problem:

  if (strcmp(restricted, "YES") == 0)
    *mode = 1;
  else
    *mode = 0;

String comparison done this way is a lot faster than multibyte (and also locale) aware comparisons (such as used in strcoll). String equality of this nature always works: if all the bytes are the same, the strings are the same (provided they use the same encoding)

The C standard library

Here is a table of string functions in C and what the multibyte ramifications are for each:

strcpy strncpy strcat strncat strcmp strncmp strdup strchr strrchr strcspn strspn strpbrk strstr strtok

FunctionMultibyte EquivalentWhen it is safeWhen to replace
strcpyn/aalwaysnever
strncpyn/aalways, although you need to be sure you don't start mid-character when copyingrandom copies of buffers containing multibyte text.
strcatn/aalways, although using this to build human readable messages should be replaced with printfalways*
strncatn/abe sure you don't copy part of a characteralways*
strcmpn/aalways: string equality is string equality
strncmpusually: be sure the 'n' value isn't mid-characterwhen searching for a single-byte character in a multibyte buffer.
strdupn/aalwaysnot multibyte affected
strchrmbschr (glibc)searching for NUL, \n, \r, \t and other controls or searching an ASCII bufferalmost always
strrchrmbsrchr (glibc)searching for NUL, \n, \r, \t and other controls or searching an ASCII bufferalmost always
strcspnn/aneverwhenever you are searching a buffer that can contain multibyte. Searching a restricted ASCII buffer is okay
strspnn/aneverwhenever you are searching a buffer that can contain multibyte. Searching a restricted ASCII buffer is okay
strpbrkn/aneverwhenever you are searching a buffer that can contain multibyte. Searching a restricted ASCII buffer is okay
strstrn/ausuallyDon't use strstr as a surrogate for strchr (e.g. the target string is a single character). Be aware that short strings may mimic valid multibyte sequences and thus cause false matches. A multibyte aware string match walks the source string using calls to mblen.
strtokrarelystring tokenization can't be done where the separator is a multibyte character or where it is a valid lead or trail byte.
strcolllocale aware: depends on calling setlocale
strxfrmlocale aware: depends on calling setlocale
strlenalways, when you want to know the number of bytes in a stringalways when calcuating how many characters are in the string
strnlenbe sure maxlen isn't mid-character
mblenmblenalways: returns the number of bytes in the next characterused to implement functions that "walk" across a char* buffer instead of using the ++ operator.

* The GNU programming guide has this to say about strcat: Programmers using the strcat or wcscat function (or the following strncat or wcsncar functions for that matter) can easily be recognized as lazy and reckless. In almost all situations the lengths of the participating strings are known (it better should be since how can one otherwise ensure the allocated size of the buffer is sufficient?) Or at least, one could know them if one keeps track of the results of the various function calls. But then it is very inefficient to use strcat/wcscat. A lot of time is wasted finding the end of the destination string so that the actual copying can start.

Useful macros:

MacroDescriptionUsePortability
MB_CUR_MAXNumber of bytes in the longest character in the current encoding.Buffer allocation.Standard C

Code to Avoid

Look out for these patterns in your code. They represent improper assumptions about the structure of non-ASCII data.

  for (p = string; *p; p++)
        if ((*p >= 'a') && (*p <= 'z')) *p += 'A' - 'a';

ASCII Math: adding or subtracting the value of 'A' or 'a' is based on the idea that all characters share an offset between upper and lowercase that ASCII uses (it's 0x20). This sometimes works with single-byte encodings, but some characters will suffer. In this case, we have an implementation of toupper which only works with pure ASCII, not Western European, and destroys trailing bytes too (because it doesn't call mblen to find out how big the next character is).

Pointer Math: using ++ and -- to navigate strings will produce problems when working with multibyte characters.

Here is how to write your own mbschr function:

char *mbschr(const char *s, int c) {
 int t = mblen(s, MB_CUR_MAX);
 while (t > 0) {
    if (*s == c) {
       return (char *)s;
    }
    s += t;
    t = mblen(s, MB_CUR_MAX);
 }
 return NULL;
}

Searching backwards through strings is affected too! A call to a function like strrchr will find trailing bytes even faster than searching forwards through a string. Writing the 'r' search functions is harder than you might initially think. That's because encoding state is determined by reading from the front of the string. If you look at the mbschr example above you'll see that it can advance the pointer by seeing how many bytes form the next character. (An interesting limitation of mbschr is that it cannot find a multibyte character in a char*--that would require the multibyte equivalent of strstr because a char* is really a byte*).

	program = strrchr(av[0],'\\');

Searching for a character in a string will fail if the 'character' is also a valid part of a multibyte character (here, our friend 0x5C). strchr can be used to search for the null ('\0') at the end of a string or for controls such as \n, \r, or \t.

Exercise

Write mbsrchr and mbsstr. [Answers]


Exercise Answers
char *mbsrchr(const char *s, int c) {
 char *found = NULL;
 int t = mblen(s, MB_CUR_MAX);
 while (t > 0) {
    if (*s == c) {
       found = (char *)s;
    }
    s += t;
    t = mblen(s, MB_CUR_MAX);
 }
 return found;
}
char *mbsstr(const char *src, char *tgt) {
 if (src == NULL) return NULL;
 if (tgt == NULL) return NULL;
 unsigned int tgt_len = strlen(tgt);
 int t = mblen(src, MB_CUR_MAX);
 while (t > 0) {
    if (strncmp(src, tgt, tgt_len) == 0) {
       return (char *)src;
    }
    src += t;
    // optimization: quit when strlen(src) < strlen(tgt)
    if (strlen(src) < tgt_len) return NULL;
    t = mblen(src, MB_CUR_MAX);
 }
 return NULL;
}