On localization and its parameters

In our 21st century, when a personal computer performs billions of operations per second, there should not be any reason not to expect a good presentation of documents and data that observes the user's preferences in colors, look-and-feel and the so-called locale. The locale is a setting or parameter of a user's session, typically a windowing desktop, but often just a terminal emulator, that controls important aspects of the behavior of application software such as language used to display messages, presentation of time and date, selection of fonts, assumptions about text encoding. Other aspects, generally less fundamental, are the customary formats of numbers, currencies, time and date, lexical ordering and such.

Alas, the trouble with locale is that it is inconsistently defined and even more inconsistently interpreted. I will focus here on the three fundamental settings found in the locale: language, timezone, and file encoding; briefly outline their current typical use and define their desired scope.

Language is the setting a user would define as the language in which he would like to read messages addressed to him by the application software. For example the setting en_US is asking for American English, while fr_FR for French as spoken and written in France. The applications that use this setting to determine presentation of date, time, currency, derive timezone, assume text encoding are wrong. The language setting in the locale should determine only the language of messages to the user. A multilingual application may be for example designed to handle text in many languages while one language would be selected by the user to be the language used in interactions with the application itself - its menus, dialogs, messages.

The timezone setting is necessary to obtain the local time and date from the universal time (e.g. UTC) available from many sources on the internet. This is often not a separate setting for the user's session but something set for the whole operating system according to the location of the hardware. However, a user accessing a computer system from a remote location may not be in the same timezone, but still he and his session are in a timezone and this information should be transmitted to the remote host so that local time and date information can be correctly generated. A completely different issue is whether the user wishes that time and date should be presented according to the location of his session or location of the host - this is up to the application to decide. It is important that the timezone setting be not dependent on or mixed up with the language setting described above. No conclusions about the timezone should be drawn from the language setting whose scope I described above.

The text encoding setting usually does the most damage. One problem is that it is not explicit, but rather surreptitiously derived by many applications from the language setting and applied to any text for which the encoding is not marked up (or even when it is marked up). The Java VM on Linux derives for example the System property: file.encoding: ISO-8859-1, apparently from my language setting en_US even though I read and write a lot in Polish where this encoding is not suitable. For people who use a number of languages and have documents in various encodings this is often wrong and results in damage to the text. The second problem occurs when the encoding setting is given by some environment variable or desktop (as in KDE) setting and a number of applications assume they should apply the given encoding to any text data sources as well as select only fonts that reflect the character indexing implied in the encoding. Then we get text damage (when editing and saving) and poor selection of fonts. For example if I choose UTF-8 as the character encoding in KDE I will probably be only offered fonts supporting the entire Unicode character set. Applications should be smarter at this point. Assuming the application can figure out the correct encoding it should take a hint from any language markup present to narrow down the character set to a smaller subset of Unicode that may be represented by a larger selection of fonts. I think we should do away with the global character encoding setting. Management of character encoding should be left to applications and their data sources. In absence of character encoding markup in the text data source UTF-8 should be assumed. The XML specification is leading the way in this direction.

Let us make UTF-8 the ASCII of 21st century!!! Let the user tell the machine in which language he should be spoken to!!! Let the user tell the machine his location on the globe and how time and date should be presented to him!!!

A final note. The operating system itself is a kind of robot that facilitates access to the CPU, memory and peripherals for the programmer. The OS should not be represented as having a "native" language. It should support file systems that are navigable via user readable strings, but these should be UTF-8 without any attached language qualification. The messages that the OS generates are mostly for the administrator who sets up the preferred language for those. The OS as supplied by a vendor will be typically limited to generation of messages in a group of supported languages, but none of those should be seen as the "native" language of the OS. Applications should be supported by the OS well enough so that they can themselves support a variety of languages requested by the user.

September 2002