Articles / A Proposal for a True Inter…

A Proposal for a True Internationalization

7bit character streams are the most secure against misinterpretation. When I send email, I leave everything else out (though we Germans have other symbols, too), just in case there is a machine that cannot handle it. It has become a whole lifestyle; to write everything in lower-case letters and put a smiley on the end of the line seems to express global thinking. But if we are honest, we know it expresses nothing but a deficiency in modern character processing. This heritage of the 70s (the start of Unix system distribution) is a hard hurdle to overcome. Fortunately, the need for usability is getting stronger and the pride of programmers and administrators is getting weaker. As a student of the Japanese language, I went trough many sleepless nights setting up user variables, input parsers, and terminal stuff, so I think I know the difficulties. With this article, I will try to express a proposal coming from both sides in me, the programmer and the user.

Today, it is quite common that people from different countries are working together. Also, though every employee may have his or her own terminal, it's likely that common applications are provided by a single file server. This creates a need for multi-language operating systems. It sounds critical, but it isn't (yet). The commands are exactly the same (Chinese sysadmins also type "mount", though I can imagine language-dependent symlinks) and answers may vary only a little (e.g., "y/n" in German is "j/n"). Until now, all other tasks of internationalization have been avoided by the system, and applications have to take up the slack. If they don't (if, for expample, the administrator forgot to install the appropriate .po files), you can be lost on a terminal with pictograms that, to you, mean nothing.

The i18n movement which started some years ago solves a lot, but not everything. With it, only output is guaranteed to match the best gettext will find. What about the input? Multibyte strings, produced by input parsers like kinput2 or ami in an 8bit or 7bit environment, are hard to handle and crack easily (if you press the delete button, it removes only half a sign). kinput2 and ami cannot run together in one terminal, because code pages intersect. Start and end sequences are one solution, but a bad one and one especially not meant for the long run. Imagine a document full of different languages; if I want a function that gives a line length for this doc, it will be the hell, and I haven't even mentioned what will happen when new languages with new start and end sequences are implemented.

Also, we have so many applications which handle text and formatting. Integration of multiple language parsers into them may take 5 times more than implementing the problem-specific algorithms. I think something like Microsoft's IME, a central (system-wide) solution, is needed here. Unfortunately, IME is not Open Source, and is therefore un(sup)portable.

Next problem: Character encoding. Oops, this discussion is as old as computers are. Every nation had its own coding scheme, using the same domains! What a crappy idea! How could somebody let this happen?! OK, you say we have Unicode. Unicode was a good idea, until they found that 16bit is too little. Also, look at Yudit's encoding list; there's not one single Unicode, but many: UTF-7, UTF-8, UTF-16, etc. Furthermore, Unicode text files have a starting sequence, and Windows saves Unicode with low-hi byte order, but Posix systems don't. Java uses wide characters (16bit) internally. Wow! Now it means nothing. 16bit is just too little; it was only for the short run.

Next problem: The console. Its fonts and behavior differ from those of X (which, ten years after the invention of TrueType fonts, still lacks correct handling; take a look at Abiword, and you will understand). If I were Chinese, I would want to also see Chinese on my console, but this is even harder than under X, not to mention input routines. But what's the difference for an input parser between X and the console?

Next problem: Somebody better stop me from complaining. We have to move on. We still use the old stuff, but are now saving in XML. This is not very revolutionary. I will try to take a step forward. I'd like to present a solution. It's time to think about an all-inclusive, simple, and working system design.

But first, again, a collection of the problems mentioned above:

  1. Machines are 7bit- or 8bit-oriented, and the input is, too. This is historical, but we have to overcome the compatibility paranoia, or there will be no progression.
  2. Code pages intersect, and we have to improve the Unicode scheme.
  3. We want input parsers on the operating system level, at a central place, which serve both X and the console. Also, a basic font that holds all characters for both X and the console is wanted (I don't like those question marks).
  4. We want neither start nor end sequences.
  5. We want user realtime language switching for input (and maybe for output, too).

Fortunately, there are now these advantages which we can use:

  1. We process addresses and integers as 32bit, we have 32bit buses, and next generation computers will have full 64bit architectures.
  2. Applications do not contact keyboard codes; they already get the values through the system (nearly unprocessed, but at least a keymap is used).
  3. Computers are quicker than ever, giving us enough time to parse the input correctly.
  4. So many routines for combining characters have already been written.
  5. We have enough memory for the BFF (Big Fucking Font).
  6. We have enough hard drive space for the text files.

Especially when I think about points 5 and 6 of the advantages, I say: Why bother? Let's give it a try. What I propose is first a new char type that consists of 32bit lengths. This will give us the security that in the future, no characters of any language will be outlawed. The most low-level routines (that write to the buses) will have to be changed. Upper-level APIs may stay the same (as user programs do), as long as they do not play with overflow (255 + 2 = 1) calculation. And, for heaven's sake, I propose to use only 7bit of the 4 bytes. Still, we would have around 270 million signs available. You might say "That's way too much; 3 bytes, like for my display, is enough!" Well, there are sound cards that process 24bit, but the processor has to pack it into 32bit packages to enhance speed, so in the end, there's no real advantage to 24bit. Also, in another 100 years, there might be the need for more. Please throw away the idea that you will see 4 bytes when you open a terminal! A char (you could call it sign or foo or bar if you like) will be an atomic piece of data. This view also fits in the modern multimedia processing arena, where sound data consists of 2-byte 2-channel or, for studio work, 24/32-byte multi-channel data structures. Binaries consist of 32bit, as do video streams, the most complex data we know today. The text file just hindered us taking this revolutionary step.

If you think this will blow up your filesystem, you are most likely wrong. Take the sizes of your text files, multiply them by 4 (or 2 if you are using CJK encoding text files), and compare them with your wav, MP3, or DivX files' sizes. Those files will not get bigger. The 7bit style is for old Internet routing hardware, but I think that in another 100 years, it won't still be there. Then, these domains may also be used. The encoding scheme is clear: As the number, so the saving and loading. No conversion. Hi-low byte order is preferred and seems more logical. We could use what Unicode did for 16bit, seamlessly integrating domains, allowing enough room between language domains. Unlike UTF-8, we don't want different sizes for Western and Eastern characters; that makes programmers unhappy and software difficult to control. Also, UTF-8 emphasizes historical Western domination of computing science, which is not very friendly. No start and end sequences -- that's it.

Let's go on. There will still be a mapping between keyboard-sent codes and the 32bit chars attached to them, as a phase of preprocessing. The next step will be checking the user's input choice and sending the data to the parser. This parser will build buffers for input, syllable buffers, chosen readings, etc. The buffers will belong to the system. They will be cleared when switching between parsers, but we need the ability to foresee what we type. Under X, window managers may have a small buffer dock app in which you could see the language symbol of what you're typing. Take a look at IME, and you'll know what I mean. It might be more difficult in the console, but with libs like ncurses, there might be a way to give a better view on the writing.

Also, shells might stop character echoing and write buffer contents instead, then clear back to the last breakpoint, write the new sign, change buffers, and write them again after the new sign. I did this once with a small Japanese console learning app, and it's fast enough that you don't see it. When pressing the enter key, I propose that we use only ready-typed input; otherwise, some shells might want to do so and some do otherwise, not giving a standard behavior.

Now I will think about one of the most-feared things in the computer world: The change of data structures and backward compatibility. First, a single machine holds its data -- text files, binaries, etc. It is most likely connected to other machines or the Internet, and that's where I begin. These new generation operating systems that fully process 32bit data from hard disk to whatever will be compiled and booted on machines, but will get data through FTP or other services that will get tons of chars of the old type. Their buffer (char[]) will be a 32bit[] array, provided by the system (because sizeof(char) returns 4!). The service now writes it to disk, but, fortunately, the system does it for us, because the system always has a fear that some applications might damage the hardware. For the newly-installed machine, there is no difference in behavior except the file sizes for texts. When the new machine provides services, there will be the problem of sending too much information. If the client wants the next byte (or char), there might be a problem if it gets a value above 255 (or 127 signed) and dumps an error message or disconnects.

In the end, there will be no progression without backward compatibility problems. Network connections are a big advantage here. We should try to use it and finally throw away our fear, because the gain will be a clear character processing solution that works world-wide, with no more hassles with encoding schemes and browser displaying problems, and a user-friendly, simple-to-use, speedy and secure multi-language interface. Also, encoding information can be left out, which cleans up email and XML files. The first big step will be to change low-level system routines, and for that I wish us all some more courage towards a change of thinking.

Recent comments

26 Jan 2003 14:23 Avatar kjkhyperion

Re: It's not that bad, actually

> Also, it's just not true that Unicode
> text files start with a special
> character sequence. That might be a bad
> Windows habit, but it's not required by
> any standard.


I fail to see how could it be "bad". After all, it's just a matter of putting a zero-width space at the beginning of the file: even if programs don't recognize it, it will just show up on screen as... nothing, since a zero-width space doesn't have a glyph nor any width

You don't want to support storage in anything but UTF-8? well, who are you to decide? someone will probably have to, and the Byte Order Mark (an Unicode standard, just not a requirement, as any definition of "beginning of the stream" can't cover all the possible cases) is the smartest way it could have been done

13 Sep 2002 01:34 Avatar drg001

In the middle of secrecy…
In the middle of secrecy…

(Sometimes, a smile can do more …)


Warning:

You can read further only if you can keep this information top secret from everybody, including your friends, and, especially from nobody. Do not copy this document and, God forbid, do not distribute it through e-mail, regular mail, by word of mouth or psychic ability, even if you do not have one.

http://www.tupbiosystems.com/articles/secret.html (http://www.tupbiosystems.com/articles/secret.html)

04 Sep 2002 11:14 Avatar barrett9h

Re: a whole lifestyle
That's true, but I only use this kind of writing when I'm "talking" to my friends on email, irc, icq.


When writing a letter or something more important I like to write the "right" way, and I still know to write correctly when I need to... :)


The language is always envolving through history, and this (net talk) is just one more adaptation.

03 Sep 2002 17:07 Avatar LodeRunner

Re: a whole lifestyle

> i just use some adaptation, when
> apropriated.. (for ex. im portuguese "e"
> and "é" are two different words,
> and i type them as "e" and "eh".
> everybody understands..) and don't care
> about being gramatically correct
> anyway.


Man, this is ugly. This is just like using
"naum" instead of "não". It is, like somebody
else said in this thread, a lack of self-respect
towards one's own language.


> the important thing is to comunicate,
> and i type much like the way i talk..


Yes, this is happening more and more. And it
is very unfortunate. The quality of written text is
decreasing. A friend of mine even wrote "aí eu
peguei e fui lá"... :(

30 Aug 2002 00:41 Avatar acli

Re: ASCII
If you read comp.fonts (through Google Groups I suppose, to dig out the old old articles), you may realize that ASCII does not even support English. Why? Because there is no code point for the two kinds of dashes that are required by grammar, and no distinction between opening and closing quotation marks. Worse (contrary to what some people think and try to make other people think the same), some code points (e.g., apostrophe and grave accents) have valid alternative meanings (e.g., closing and opening single quotation marks). Some English words require accent marks. And the "Icelandic" thorn and eth letters used to be English letters a very long time ago.


IMHO, the decreasing quality of English punctuation use is directly attributable to the spread of computers.


The charset conversion problem is not Unix-specific. Users of Chinese or Japanese Windows / Macintosh see it all the time. English-speaking people were just not used to seeing this.

Screenshot

Project Spotlight

Kigo Video Converter Ultimate for Mac

A tool for converting and editing videos.

Screenshot

Project Spotlight

Kid3

An efficient tagger for MP3, Ogg/Vorbis, and FLAC files.