Ken Krugler Internationalization Meeting
Ken Krugler came to talk to us about internationalization; Katie wrote up things we were interested discussing at this meeting at
KatieParlante20041129. See also
InternationalizationIssues, where most of what's found on this page will (eventually) be discussed and described in more detail.
- Ken said that if the US takes 0 additional effort units, and EU takes 1 effort unit, Japan/China is typically about 4 effort units. He pointed out that the Indic languages are contextual, so the glyphs change when you add letters. He doesn't know if wxWidgets handles contextual, how much does it take? Katie said that we aren't going to put much effort into making wxWidgets work right because that's something we can put off until later. She wants to focus on making the architecture work.
Email
- For email auto-charset detection, the biggest problem is incorrect tagging: messages that say they are one thing that actually use characters from another encoding. Usually this other encoding is a superset - for example, the email says 8859-1 but the message contains characters from 0x80...0x9F, which is part of Windows code page 1252. So you want to auto-expand the reported charset to be the biggest superset that makes sense. If the charset is us-ascii then the expansion is locale-dependent; for example, in the US it would be CP1252, but in Japan it would be CP932.
- For untagged data (more common on web sites) you have to go through a sequence of guesses as to which encoding it really is. Mozilla has a good charset/language detector that is pretty stand-alone.
- The number of people whose email clients don't recognize Unicode (UTF-8) is falling. Ken thinks it's probably around 10%.
Datetime
- For datetime, there are four different aspects that we care about:
- what we're storing in the repository (e.g. mseconds since some epoch, like 1970)
- formatting the time when in UI (locale & user prefs sensitive)
- parsing date/time text entered free-form by the user
- having a nice Python object that will implement methods like "add a month".
- Making a nice Python object rather hardcodes the Gregorian calendar -- "add a month" means something different in the Gregorian calendar and the lunar calendar. But this seems reasonable, given that true support for non-Gregorian calendars is way way down the priority list.
- Sometimes we want to store just the date and not the time, e.g. for birthdays. We don't want to store the event as "starting at midnight lasting 24 hours" because then things get messed up on timezone switches. Katie hasn't yet found a good Python Date object.
Timezone
- Most timezone packages don't have historical timezone data. For example, they don't have rules for how the timezone changed changed in England two hundred years ago. We probably don't care, but we should be aware that times in the past can be messed up. This is more an issue for smaller locales where DST rules have changed recently or might change in the future.
- The current proposal is to always store datetime values based on UTC. But this creates problems when you change timezones, if the software auto-updates entries to local time, because you often enter an event in the timezone for where it's occuring, not your current timezone...thus these entries effectively have the wrong datetime value. One option is to allow events that have "unspecified" timezone info, and thus they float (are always at the same time, no matter what timezone you're in). Thus, if there is something you do every day at a certain time in the local time zone (e.g. take your medicine every day at 7:30 AM), then that would be "unspecified." Unfortunately, that breaks completely with sharing, and adds additional UI complexity.
- On some operating systems/calendars, when you change the timezone of your calendar, you also change your computer's timezone.
Storage
There are four places where the appropriate translation for UI strings can get loaded:
- in parcel loader (from .po, .mo (compiled .po))
- repository fetch
- inside CPIA when building widget
- inside widget, using wxWidget standard code (may be required for wxWidget dialogs)
At the parcel loading time, we can make symbolic link to the translation table and have lazy evaluation. Ken points out that if you keep everything in a .po catalog, this works at all four levels (including dialogs).
Bitmaps are currently stored as links to places on disk. Some bitmaps will probably need to be localizable (e.g. a stop sign).
Ken pointed out that we could access resources thru a symbolic link in the repository tolocalization resources on disk. We could fall back with languages in a user-defined preferred sequence (i.e. French first, German second, English third).
There are three ways that the data can be stored:
- as the actual localized string, e.g. "foi" for "foo" (the downside is space -- it will be duplicated in the .po file and the repository -- and that the user would need to resync when changing locale)
- as a symbolic name, e.g. FOO_ID (which has the advantage of reducing space if two locales have the same value for that)
- as a location in a .po file (e.g. ps.po/#13), assuming the symbolic name gets resolved to the physical location during step 1 or 2 above.
- It might be worthwhile to create a debug version of Python that catches people doing the wrong thing with strings (truncation, sorting, searching of Unicode strings, for example)
- Spend the time to make sure our sample code is right. In the Mac operating system, a few of the examples did the wrong thing (or rather the non-int'l way), and since people cut and pasted, this propagated into the wild that they had to deal with for a long time.
- With email, the simplest thing is to do is to always convert all to UTF-16. Eudora tries to keep everything as it came in, but that messes up searching. Probably want to convert everything to UTF-16 when it comes in, and always send UTF-8.
- Use very very simple localization tools. Most localizers are not tech savvy people who don't want to learn whizzy tools -- they'd most prefer to pass around plain text files.
- You need good tools to synch translation files. If programmers tweak the english language version tweaked, need to propagate back to the translation file, and if you don't have custom tool for that it sucks. If it gets too hard to store a string, programmers start breaking the rules.
- With a single big .po files, you can get lots of key collisions. Some kind of namespace support would be nice.
Areas of general agreement:
- We can probably find volunteers to do the actual localization, but we will probably need to have some sort of localized version (even if it's Pig Latin) in order to make sure that the internationalization was done right.
- We might want to line up a few volunteers before 1.0.
- Strawman: We will store the text catalog in an XLIFF file that gets massaged into something (probably a .po) file at build time.
- We probably want to go with Python-wrapped ICU libraries for our main i18n support package.
- The parcel loader is responsible for integrating external localization files.
- We will be heavily iterative, which means there are lots of things (particularly UI) that we will worry about later. We are most concerned about architecture and APIs.
- The Schema browser won't be localizable, but there needs to be a way to localize the Schema names so that end-users can use CPIA Build Mode effectively.
- We are nervous about performance.
Open Issues:
- We need Chao to make a decision on when Chandler should get localized for which markets. Chao will need some context about how what the costs are.
- Will we have a "Find" command for looking for things inside an Item, or only "Search" for finding which messages have that string?
- Which of the following will Chandler need to implement or modify (vs. using directly from some third-party project)?
- message handling
- resource handling
- formatting/parsing
- timezone
- locale
- sorting
- searching ("Find" and "Search")
- text manipulation such as truncation (e.g. adding run-time checks)
- charset conversions
- When you change the system timezone, what do you want to happen to your calendar?
- Should we use gettext or a repository call?
- How will dialog localization be handled? Depends a bit on whether wxWidget dialogs are used, vs. CPIA.