Tuesday, August 7, 2012

Multibyte and wide character strings in C

Over a century ago, man transmitted messages over a wire using a four bit character encoding scheme. Much later, the ASCII table became the standard for encoding characters, using seven bits per character. ASCII is nice for representing English text but it can’t work well for other languages, so nowadays we have unicode. Unicode defines code points (numbers) for characters and symbols of every language there is, or was. Documents aren’t written in raw unicode; we use UTF-8 encoding.

UTF-8 encoding is a variable length encoding and uses one to four bytes to encode characters. This means some characters (like the ones in the English alphabet) will be represented by a single byte, while others may take up to two, three, or four bytes.
For C programmers using char pointers this means:

strlen() does not return the number of characters in a string;
strlen() does return the number of bytes in the string (minus terminating nul character)
buf[pos] does not address a character;
buf[pos] addresses a byte in the UTF-8 stream
buf[32] does not reserve space for 31 characters;
buf[32] might reserve space for only 7 characters …
strchr() really searches for a byte rather than a character

If you want to be able to address and manipulate individual characters in multibyte character strings, the best thing you can do is converting the string to wide character format and work with that. A wide character is a 32-bit character.

The operating system is configured to work with a native character encoding set (which is often UTF-8, but could be something else). All I/O should be done using that encoding. So if you do a printf(), print the multibyte character string.

During initialization of your program (like in main()), set the locale. If you forget to do this, the string conversion may not work properly.

setlocale(LC_ALL, "");

Converting a multibyte character string to a wide character string:

mbstowcs(wstr, str, n);

Converting a wide character string back to a multibyte character string:

wcstombs(str, wstr, n);

One problem with these functions is estimating the buffer size. Either play it safe and assume each character takes four bytes, or write a dedicated routine that correctly calculates the needed size.

It’s fun seeing your program being able to handle Chinese, Japanese, etcetera. For more on the subject, these two pages are highly recommended:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Joel on Software
UTF-8 (UCS Transformation Format 8-bit) Wikipedia