Python Internationalization
The Problems
- We want to make sure all strings containing UI text are Unicode encoded.
- Even with Unicode strings, basic Python operators don't work correctly.
- We'd like to use existing Python support for I18N, but it's often based on C libraries
All Strings In Unicode
Unfortunately it's easy to create a Python string that uses the default ("plain text") encoding, instead of Unicode. All you have to do is say
"some text" instead of
u"some text".
Hopefully we won't actually have any raw text in Python code. And we could use scripts to scan for raw text, to help enforce this.
I don't know whether you could set the default encoding (via
setdefaultencoding()) to be utf-16-be or utf-16-le, so that even strings created without explicitly specifying Unicode would be created as Unicode strings.
Python string operators
Even with a Unicode string, typical operations on strings won't work properly. I call this the curse of strcmp - things seem to work OK, as long as you're using ASCII (read "English"). But then you start running into problems with European languages that make common use of accented characters, and it totally breaks down with Asian languages like Japanese.
For example, using
if (uniStrA < uniStrB): doesn't work as expected, since it does binary comparison versus true collation. Slicing a Unicode string can also split a code point in half, if it's comprised of two 16-bit code units (surrogate pair). And a grapheme cluster (e.g. "u" + zero width joiner + umlaut) can also get split.
A more obscure example involves transliteration. Python tries to leverage Unicode data to handle upper and lower-casing of text. But this is actually locale-dependent; for example, upper-casing a lower-case 'i' in Turkey should give you u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE), not "I".
We could create an ICU string class that you use everywhere versus built-in Python operators. But then you have to make sure all your code doesn't accidentally use the built-in support.
We could modify Python to use ICU for Unicode string operations, perhaps under the control of a switch (or triggered by the locale selected). But I've heard that getting Python to accept direct use of ICU isn't likely, and it also might cause unexpected results (e.g. a slice winds up being empty, because the requested index is in the middle of a grapheme cluster).
Python support for I18N
Python tries to leverage the standard "C" libraries as much as possible. This leads to problems in areas of I18N support, for things like locales, date/time, etc. The two common problems are that the library support is lacking, or there are platform dependencies that make it hard to work in a nice cross-platform manner.
One solution is to use external (non-core) Python modules such as mxDate.
Another approach is to wrap ICU so that it can be used from Python to provide the required I18N support.