Saturday, August 11, 2012

Multibyte and wide character strings in C (addendum)

Last time I wrote about multibyte character strings in C, and said that the easiest way to deal with them is to convert them to wide character strings. Unfortunately, there is an issue with the wide character wchar_t type; it just so happens that its size is 32 bits on UNIX (and alike) platforms, while it is only 16 bits wide on the Windows platform. On UNIX mbstowcs() converts to a UTF-32 string, and on Windows mbstowcs() converts to a UTF-16 string. What this means is that everything I talked about in last week’s post is quite alright on UNIX and not so cool on Windows. I’m a UNIX programmer and I don’t work on Windows, but I do care about portability across platforms, and the wchar_t is hopelessly broken across platforms.

So, what is going on? The C standard actually says that the size of wchar_t is compiler dependent, and that portable code should not use wchar_t. Oddly enough, C does provide a complete set of functions for handling wide character strings (!). Since wchar_t is not defined as a portable type, then a) how are we supposed to work with strings and unicode, and b) what is it doing in the standard in the first place.
The origin of the problem stems from the fact that Unicode started out with just 16 bits code points, but later realized they were going to need a few bits more. Hence the jump to 32 bits. By that time, Microsoft was long happily using a 16 bits wchar_t. When others started supporting unicode, they implemented wchar_t as a 32 bits value so it could hold a single UTF-32 character. Portability ended right there.
Consequently, wchar_t was a good idea that turned out as a failure.

Today, if you want to work around this problem, you are going to have to work with uint32_t for characters and roll your own string functions, including your own UTF-8 encoding and decoding functions. It’s pretty sad. There is a bit of good news on the horizon; the proposed ISO C11 standard includes two new data types: char16_t and char32_t and associated conversion functions. Missing however are string handling functions for these types. Basically, it is discouraged that you use strings in their UTF-32 form. There is no compiler today that implements C11. These new character types are also present in C++11, and recent versions of the g++ and clang++ compilers do support them.