OSTwoUsrManCPPStrings

I'm not a Unicode expert, so if you want to correct any of the technical information below, feel free. This is just my understanding.

The Problem

To understand the problem, you need to understand the difference between a character set and a character encoding. A character set consists of all the glyphs that make up a particular set of characters that you want to work with. Unicode's advantage is that it contains a nearly endless character set that covers lots of languages. A character encoding, though, is a way of mapping a character set to actual numbers that you can manipulate in a program. If you change the encoding, the same number can map to different glyphs. That's why switching character encodings on a web page makes the page display so differently even though the page data is the same.

With Unicode, addressing every possible character requires up to 32-bit code points in the encoding. But in most applications, most data lives in the ASCII character set, which fits in 7 or 8 bit code points. This would result in a huge waste of space. With today's memory sizes, the most efficient Unicode encoding that still results in one character for each on-screen glyph in many languages is a 16-bit encoding.

This is what Windows uses internally, the UTF-16 encoding. Each character takes up at least 2 bytes of space. Xerces, logically, also uses UTF-16 as its native encoding. This means that all DOM information is stored in UTF-16 form in memory, and accessing data in the DOM returns 16-bit UTF-16 strings of type XMLCh* , which is an alias for wchar_t or unsigned short .

The problem is that very few C library implementations understand UTF-16 natively. So-called "wide strings" in C are not required to use a particular Unicode encoding, and don't even have the same character sizes. Sometimes characters are even 32-bit and not 16-bit. So it is impossible to use C library functions portably to manipulate the XML data. Even worse, you can't even use string literals. On Windows, you can get away with it, but for portable applications, literals have to be laid out as arrays of Unicode constants.

Finally, due to some requirements in the ISO C++ Standard, it turns out that creating std::basic_string specializations for built-in types like unsigned short is technically not legal without a lot of difficult localization work. Most compilers let you get away with it because it's an important use case, but some compilers don't. The XMLTooling library includes a class called xstring that is declared when the configure script detects that it can do so. This greatly simplifies the creation of Unicode strings using C++ idioms, but only works on most platforms, not all.

The Implications

The fallout from all this is that manipulating XML information with this code can be tedious at times. Just setting a predetermined value is a hassle because you have to obtain a 16-bit version of the value in order to set it. You'll need to become familiar with the Xerces XMLString class, which is not a great API, but does contain a variety of useful string functions that can manipulate both 8-bit and 16-bit string data.

Among the more useful features is a method called XMLString::transcode() , which converts between UTF-16 and the local code page. Note that this does NOT mean that every possible Unicode character will be handled properly. The whole problem with local code pages is that they can't represent every (or even most) Unicode characters, so you do lose information. The best use for transcoding is to deal with information that is known to be mostly ASCII, or when some lossage is allowable.

Another option for simplifying work is to move between UTF-16 and UTF-8, which is an ASCII-compatible superset that uses a single byte for many characters, and byte pairs when required. The main advantage is that UTF-8 never uses the NUL character (0) for anything other than termination, which means that UTF-8 strings can be treated as NULL-terminated ASCII strings for basic use cases. Some functions will not return accurate results (e.g. strlen cannot detect byte pairs and will return the number of bytes, not the number of actual Unicode characters) but for many purposes, this doesn't matter.

Helpers

There are a handful of useful classes and functions provided by the XMLTooling library to assist you. The Xerces XMLString class and the xstring class have already been mentioned. The unicode.h header also includes a pair of auto_ptr-like classes that can handle both memory management and transcoding of data between forms.

For example, to turn an ASCII literal into a Unicode form:

auto_ptr_XMLCh wide("literal");
functionTakingUnicode(wide.get());

Another pair of functions (toUTF8 and fromUTF8) is available to do full-fidelity conversion between UTF-16 and UTF-8. Memory management here is up to the caller:

char* narrow = toUTF8(unicodeData);
delete[] narrow;

It's best to avoid true UTF-8 conversion unless you really need it, because it's somewhat expensive.