Articles / Linux Internationalization …

Linux Internationalization Problems

Linux continues its march to the desktop, strengthened by the arrival of Open Office and other non-hacker applications, but what good are these apps to you if they don't speak your language? In today's editorial, Juraj Bednar asks that the community not forget localization if it wants Linux to be an alternative for the non-English-speaking world. Many people in English-speaking countries are pushing Linux to the office. They now have wonderful Open Source office suites like AbiSuite, KOffice, and Open Office, and some commercial office suites as well. Everything works just fine for English users (about 508 million people speak English, which is about 12% of the world's population; the percentage is much higher when counting people connected to the Internet, but it's still not everyone). There are some efforts to support very problematic languages (Chinese, Japanese, Korean) which use different characters to encode their writings, but others are not receiving all the attention they need.

In this article, I would like to explain the basic issues with Central European languages and how to avoid making mistakes. The first step in making Linux "your-language-friendly" is to create a locale for your language. A locale is a set of definitions of how to represent and process various data types like time, date, monetary symbols, special characters, and so on. One of the important parts of a locale is the so-called message translation definition, a set of files which define how certain messages are translated to that particular language. There's usually one such file for an application, a hash table which contains all the application's messages, so it's generally the translation of the program's user interface.

The problem is not related to including these locales in certain distributions (they're part of glibc, and it's quite easy to add to glibc if you want to). The problem is with setting the locale parameters for each user. This is what almost no distribution considers when setting up users. There should not be a system-wide default, because Linux is a multiuser environment. Each user should be able to set his own language variables. If he wants to do it now, he has to edit his .bashrc or similar file to have the proper values set. This is not very user-friendly.

There is almost no problem with locales and translations, so, in most distributions, you can see the messages in your language when you set the correct locale and have the messages installed. Now we want to type our characters and see them, so we need fonts for displaying. There are not enough free fonts, but most distributions include those which are available for each character encoding.

Keyboards are more difficult; there's no general way for users to configure them. Many distributions with graphical installations used to read a directory called rulesets, which is outdated and contains deceiving information (for example, a "Czechoslovakian keyboard" -- complete nonsense, since Czechs and Slovaks use different keyboards and different characters). There is also the problem of not setting the correct locale (which causes the keymap to not work correctly). These are major internationalization problems which are not so difficult to solve. All it wants is just a bit of good will from the distribution creators (they can contact me if they want to discuss something; I really want Linux to be usable in my country and with my language).

There are also more difficult problems to solve. One is the problem of locale and switching keyboards. When I went to Norway last year to visit a friend, I found a problem: I wanted to switch between Slovak, English, and Norwegian keyboards. Since it's quite easy under some systems, I thought it would be no problem with Linux. I launched xkbsel, which switches the keyboards "on the run". The problem was that the keyboard doesn't work without the correct locale. If I started xterm with locale set to Slovak, I could not type Norwegian characters. If I set it to Norwegian and started another xterm, the Slovak characters were not working. The cause was that the Slovak keyboard mapped keys to ISO8859-2 characters, while the Norwegian keyboard used the ISO8859-1 charset. There are characters in one charset which are not present in the other. It was not possible to use the particular keyboard without setting the corresponding locale. Currently, this means restarting the application with the correct locale set.

The next problem arises when translating applications. Currently, we mostly use GNU gettext to do the translation. It is quite nice, but, in some cases, not sufficient. In many languages, the translation of a sentence can differ according to the context. Since there is no context information, it is difficult to make correct translations. The KDE team solved this issue by putting the context information into the message identifiers, so it works correctly with a few workarounds, but that's not a real solution. In English, for example, a noun differs only in its singular and plural forms (you have "one file" and "two files"). In Slavic languages, the plural form is often not regular (in Slovak: "1 súbor", "2, 3, 4 súbory", "5, ... súborov"). This is another issue to be considered when creating an application (currently, the programmer has to think about this, but the easy solution would be to create a framework).

The KDE team is developing workarounds for most of the problems I describe here, but I also want other developers and distribution manufacturers to be aware of these problems and to try to solve them. Otherwise, Linux will stay English-centric, and that would be bad for Linux itself.


Juraj Bednar (http://www.darkie.sk/index.en.php) is a security consultant and a columnist for a Slovak computer magazine. He has been a member of the KDE i18n team since the 1.0 release. He can be reached at bednar@rak.isternet.sk.


T-Shirts and Fame!

We're eager to find people interested in writing editorials on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an editorial gets a freshmeat t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about.

Recent comments

24 Sep 2002 00:56 Avatar proskin

Using gettext to display plurals
GNU gettext supports plurals starting with version 0.10.36 (released in march 2001). Look for function ngettext() in the manual.

15 Mar 2001 03:42 Avatar nikolasomlev

cyrillic
I am very disappointed that I haven't met Linux

distribution supporting Bulgarian. Recently I've

tryed RedHat 7.0, Slackware with kernel 2.2.16,

Mandrike helium ...

None of them succeded in providing writing in

bulgarian, either on console, or in office

applications, not to mention printing.

I'm a kind of angry.

28 Oct 2000 17:42 Avatar ramv

Linux Interantionalization problems
has anyone checked out ICU?
The article mentions problems with the concept of singular and plurals nouns in languages... this has already been taken care of use ChoiceFormat API. Renderding of complex scripts it is available, bidirectional rendering, word breaking, line breaking is too. Which currently ships with
Debian distros ;-)

Nagari scripts?
I donot agree with the comment that Unicode is inadequate for representing Nagari scripts. Unicode's Character-Glyph
model address a character as a "character" not the associated glyph. Every Nagari variant alphabet has a finite number of characters and sounds are represented by
the use of conjugates/ligatures. Unicode defines a standard
algorithm for rendering Indic scripts and I have not seen a
problem rendering them provided you have a smart layout engine. Taking the example of Banglore, it will look funny if you look at the hex dump cause most people are used to English/ASCII form of representation.. but the important thing to understand out here is that when a ligature has to be formed between a consonant+consonant+vowel, which is GA+ LA+OO the base consonant sounds as if there is a virama is attached to it; i.e; it is GG ;the secondary consonant is stressed with the vowel. IMHO complex ligatures can be adequately represented.


I18N.
Posix locale format is too dumb for localization it has no concept of inheritance.The latest technical report on locale format for Posix TR14652 is built on the flawed model so I cannot expect it to be any better. I prefer ICU/Java locale model

24 Oct 2000 11:05 Avatar danohnesorg

Dont cry, just do it
I must again say somethink to the comment from Egmont Koblinger. We have similiar problems, but we have solved many onf them.

You say, there is somethink untranslated or some bad translation of somethink. But it is not a problem of developers. It is problem of Hungarisch localization teams. You should help them to make perfect translations. I knnow it is difficult and I see, that there are match more programs translated into Czech compared for example to Germany. And there are 10 mil. Czech over the world and 100 mil. Germans and there arent people wich would translate somethink into they native language. I think the problems is that every German can buy Windows for a week of work, but only few Czech can buy Windows for two months salary, so there is big interest in thinks wich are cheap. But there are aleso people, which doesn't find boring to translate somethink in tve evenings. You should find such a peple in Hungary and organize them. Why should the Hungariens have fewer translations than Czechs?

We have also two versions of locales, one has 3 letters names of months (this is very unusual, becouse our names doesn't differ in first three characters) and second version with longer names. Even the mc can work with them.

TeX and LateX: we had many problems with this packages, becouse even Donald Knuth hasn't known all characters, which the Czech language uses. But we have also very good TeX gurus, which has made csplain and cslatex, which can use our characters, our hyphenation, our special modes of using hard spaces etc... Tanhks to the SuSE we have even postsript fonts in our encodings.

Everithink can be done, but we are the workes, which must done it. Even if You cannot make patches, You can at least send a letters to the developers. Sometimes it helps very quickly (yes I have send about 50 letters to netscape and there was nothing better, this is another story), for example the modlogan supports now czech and another languages and I have send only ONE e-mail with exact description of the problem a with sugestion for the developer. There was only one problem, no one has reported that is has problems with another languages.

Everyone should send letters to developers, becouse now they are saing, the Czechs wants everytime somethink special...

23 Oct 2000 07:46 Avatar egmont

Still lot of work to do...
We must distinguish several kind of users. There are hackers who know
how to set their own LANG and other stuff, but English is usually right
for them. There are users who can spend several hours in front of their
computers to find out how to set LANG, and sooner or later they will do
this. There are system administrators, who want to set a default LANG
for users but make it easily changeable for them. And there are users
who don't know anything about that, they only want the computer to talk
to them using their native tounge. As Linux becomes more than just a
hackers' OS, the number people belonging to this last group increases
very fast, and programmers must take this to account. We all want Linux
to be a frienly OS to all those people who are not willing to learn what
.xsession is and how a simple text editor works, but only click the
mouse and use several big applications such as netscape, staroffice, etc.


Being a system administrator, upgraded our system to SuSE 7.0 yesterday,
I've played a bit with the LANG=hu_HU setting, I wondered how about
setting it as a default for users. I was very very disappointed.

First, there are only really few applications that can speak Hungarian.
"mc" is one of these. The main starting screen of "mc" already contains
a typo. The name of the months and weekdays are written in all uppercase
characters, which is needless to say, very disgusting. (Okay, I'm not
talking about mc anymore, I'm talking about glibc.) In real Hungarian,
the names of months and weekdays do not even begin with an uppercase
character (expect of course if it is the beginning of a sentence). The
abbreviation of several months is 4 or 5 characters, though only the
first 3 of them are present in libc. In case mc stripped those
characters because of its string formatting rules, I'd say it's okay,
but this is not the case, here glibc is incorrect.

My next disappointment was alphabetical sorting, this also has a very
big bug. By the way, if you set LANG=hu_HU in your .bash_profile or
something like this, this will change the sorting to (buggy) Hungarian
in all the subshells and commands started from this shell, but not for
this shell, so an "echo *" will still use the default sorting, but
starting a "bash" manually and typin "echo *" there will use the
Hungarian sorting. Is there a correct solution for this? There could be.

On many systems there's a file /etc/environment. This is a very good
thing since the system administrator can set default variables for the
users without having to edit several different shell's initialization
files. If login or sshd or xdm or anything uses this file, then a LANG
setting there can solve the problem, but only for those who are
satisfied with the default LANG set by the superuser. A very clear
solution that I think we should implement is the following. For all the
programs that authenticate a user, such as "login", "sshd", "su",
"[xkgw]dm", parsing /etc/environment and then ~/.environment should be
mandatory. A glibc-call for doing this should be written, which can
filter LD_* if the developers think this is a security hole, but I don't
think so, users may find useful if their login shell is started with a
special LD_PRELOAD variable. When this is done, a nice graphical
interface could be written, which allows the user to enter any
name=value pair, and also allows to choose the value for LANG, TZ and
other special variables from a list. Technically this graphical
application should write ~/.enviromnent, or optionally /etc/environment
if run by root. (Graphical should mean both X11 and ncurses frontend.)

This method is still not perfect, since KDE wants to change the language
of the application on the fly, but much better than any distribution
has.


Returning back to the Hungarian libc messages for a short time, libc
error messages (such as "No such file or directory") are not yet
translated to Hungarian. I wonder why.


We can talk about internationalizing applications, but as long as these
basic problems mentioned above are not solved, you can't really do much.
You can't expect that the system administrator will explain to all the
users that if they'd like to change the language, they need to set LANG
in at least two files, .bash_profile and .xsession, and explain them how
to use a simple text editor. You can't expect that a user who tries
Linux at home will find it out before giving up. We need to help all of
these users by providing exactly one, very clearly designed and
implemented way of chosing a language, and this should be a both
X11+ncurses utility, and we must not forget, most of the users do not
care about how this utility works, what files it modifies, they only
need a tool which is 100% perfect.


Let us go to the different fonts, such as Latin-2 used in Hungarian.
Look at all those applications that are developed at US-ASCII or Latin-1
parts of the world. Look at netscape, when you download a Latin-2 page,
it is diplayed using Latin-1, and after clicking on 'reload' it will be
displayed correctly for the second time. Look at all those oversized
office packages that cannot handle Latin-2 either on the screen or when
printing. Look at sgml-tools, which pretends to be one of the best
documentation systems, though is impossible to generate Latin-2 TeX
files. For the developers of sgml-tools, it would need approximately 5
minutes to add a command line option which generates Latin-2 Tex, DVI
and PS files. Now, if I want to generate a Latin-2 PS file, I have to
only create a TeX file using sgml-tools, change the character set in it
either manually or with a sed script, and give it to latex to compile.
By the way, it took me about an hour to solve it (after severar years of
experience with Linux and TeX), needed to use strace to find out what
programs sgml-tools launches when creating a PS file and what extra
environment variables it passes to LaTex. Thank you, all the developers
of sgml-tools.

And look at all those programs developed in Central Europe. Look at
"links" for example, which far the best web browser I've seen, handles
all the characters correctly. (Okay, "lynx" is very good, too.)


And what about Latin-2 on the Linux console? If you set a Latin-2
character set, the line drawing characters of "mc" will not work. Is
there a solution? Yes, there is, some years ago I've spent several days
creating a character set based on cp437, changed several characters'
layout to the Hungarian accented characters. Yes, you see, this is not a
Latin-2 character set I've made, it only contains Hungarian letters
correctly. And this is the only character set I know about which
contains all the Hungarian characters and "mc" still draws nice boxes.

Juraj mentioned the problem of plural form in Slavic languages. In
Hungarian there's a similar situation, "the 4th" and "the 5th" are
translated to "a 4." and "az 5.", respectively. Whether to use "a" or
"az" for "the" is the same as "a" or "an" is used in English. It is
about 2 or 3 lines of C code that can determine whether the
pronounciation of a number begins with a consonant or a vowel. But even
Hungarian programmers are too lazy to code this, they simply write
"a(z) %d.". This problem really should be solved once and for all in a library
(either libc or a different new one) for all the languages around the
world. Obviously translating the English message to all the other
languages is a brain damaged idea. The correct solution is to translate
the message which is in a special abstract language to all the human
languages including English. We need to help the translators using
macros. For example a newly created printf2() call should have a %Naz
macro, where N is a number, for example %3az is replaced by "a" or "az"
depending on the pronounciation of the number given as the 3rd argument
to printf2(). Different languages should have different macros
implemented.


There's one more problem with translations. Often translation of a menu
contains the same shortcut key for two different actions. Try "mc"
(version 4.5.50) with LANG=hu_HU, try do delete a non empty directory.
When it asks for comfirmation the second time, the actions "All" and
"Cancel" both have "M" as the Hungarian shortcut key. Fortunately
pressing "M" activates "Cancel". If it activated "All", users could lose
their files only because of the wrong shortcut keys in the translation.
The same problem appears in many menus in KDE.

So as I see the two most important things to do are to create a standard
on how to set the LANG variable and other stuff (~/.environment, etc),
and to desing and implement a framework where it is very easy to write
a word in plural in Slavic languages, wery easy to make the name of the
month appear either at the beginning of a sentence or in the middle,
very easy to write correct "a" or "az" in Hungarian, and so on.

Nowadays translations are usually written by programmers, They very
often make smaller or bigger mistakes, or disgusting translations.
Rather, translations should be make by people who are good in
literature, grammar, and can use computers at basic level. The job of
the programmers is to design a system that can very easily be used by
those who want to translate applications.

Screenshot

Project Spotlight

Kigo Video Converter Ultimate for Mac

A tool for converting and editing videos.

Screenshot

Project Spotlight

Kid3

An efficient tagger for MP3, Ogg/Vorbis, and FLAC files.