Today, it is quite common that people from different countries are working together. Also, though every employee may have his or her own terminal, it's likely that common applications are provided by a single file server. This creates a need for multi-language operating systems. It sounds critical, but it isn't (yet). The commands are exactly the same (Chinese sysadmins also type "mount", though I can imagine language-dependent symlinks) and answers may vary only a little (e.g., "y/n" in German is "j/n"). Until now, all other tasks of internationalization have been avoided by the system, and applications have to take up the slack. If they don't (if, for expample, the administrator forgot to install the appropriate .po files), you can be lost on a terminal with pictograms that, to you, mean nothing.
The i18n movement which started some years ago solves a lot, but not everything. With it, only output is guaranteed to match the best gettext will find. What about the input? Multibyte strings, produced by input parsers like kinput2 or ami in an 8bit or 7bit environment, are hard to handle and crack easily (if you press the delete button, it removes only half a sign). kinput2 and ami cannot run together in one terminal, because code pages intersect. Start and end sequences are one solution, but a bad one and one especially not meant for the long run. Imagine a document full of different languages; if I want a function that gives a line length for this doc, it will be the hell, and I haven't even mentioned what will happen when new languages with new start and end sequences are implemented.
Also, we have so many applications which handle text and formatting. Integration of multiple language parsers into them may take 5 times more than implementing the problem-specific algorithms. I think something like Microsoft's IME, a central (system-wide) solution, is needed here. Unfortunately, IME is not Open Source, and is therefore un(sup)portable.
Next problem: Character encoding. Oops, this discussion is as old as computers are. Every nation had its own coding scheme, using the same domains! What a crappy idea! How could somebody let this happen?! OK, you say we have Unicode. Unicode was a good idea, until they found that 16bit is too little. Also, look at Yudit's encoding list; there's not one single Unicode, but many: UTF-7, UTF-8, UTF-16, etc. Furthermore, Unicode text files have a starting sequence, and Windows saves Unicode with low-hi byte order, but Posix systems don't. Java uses wide characters (16bit) internally. Wow! Now it means nothing. 16bit is just too little; it was only for the short run.
Next problem: The console. Its fonts and behavior differ from those of X (which, ten years after the invention of TrueType fonts, still lacks correct handling; take a look at Abiword, and you will understand). If I were Chinese, I would want to also see Chinese on my console, but this is even harder than under X, not to mention input routines. But what's the difference for an input parser between X and the console?
Next problem: Somebody better stop me from complaining. We have to move on. We still use the old stuff, but are now saving in XML. This is not very revolutionary. I will try to take a step forward. I'd like to present a solution. It's time to think about an all-inclusive, simple, and working system design.
But first, again, a collection of the problems mentioned above:
Fortunately, there are now these advantages which we can use:
Especially when I think about points 5 and 6 of the advantages, I say: Why bother? Let's give it a try. What I propose is first a new char type that consists of 32bit lengths. This will give us the security that in the future, no characters of any language will be outlawed. The most low-level routines (that write to the buses) will have to be changed. Upper-level APIs may stay the same (as user programs do), as long as they do not play with overflow (255 + 2 = 1) calculation. And, for heaven's sake, I propose to use only 7bit of the 4 bytes. Still, we would have around 270 million signs available. You might say "That's way too much; 3 bytes, like for my display, is enough!" Well, there are sound cards that process 24bit, but the processor has to pack it into 32bit packages to enhance speed, so in the end, there's no real advantage to 24bit. Also, in another 100 years, there might be the need for more. Please throw away the idea that you will see 4 bytes when you open a terminal! A char (you could call it sign or foo or bar if you like) will be an atomic piece of data. This view also fits in the modern multimedia processing arena, where sound data consists of 2-byte 2-channel or, for studio work, 24/32-byte multi-channel data structures. Binaries consist of 32bit, as do video streams, the most complex data we know today. The text file just hindered us taking this revolutionary step.
If you think this will blow up your filesystem, you are most likely wrong. Take the sizes of your text files, multiply them by 4 (or 2 if you are using CJK encoding text files), and compare them with your wav, MP3, or DivX files' sizes. Those files will not get bigger. The 7bit style is for old Internet routing hardware, but I think that in another 100 years, it won't still be there. Then, these domains may also be used. The encoding scheme is clear: As the number, so the saving and loading. No conversion. Hi-low byte order is preferred and seems more logical. We could use what Unicode did for 16bit, seamlessly integrating domains, allowing enough room between language domains. Unlike UTF-8, we don't want different sizes for Western and Eastern characters; that makes programmers unhappy and software difficult to control. Also, UTF-8 emphasizes historical Western domination of computing science, which is not very friendly. No start and end sequences -- that's it.
Let's go on. There will still be a mapping between keyboard-sent codes and the 32bit chars attached to them, as a phase of preprocessing. The next step will be checking the user's input choice and sending the data to the parser. This parser will build buffers for input, syllable buffers, chosen readings, etc. The buffers will belong to the system. They will be cleared when switching between parsers, but we need the ability to foresee what we type. Under X, window managers may have a small buffer dock app in which you could see the language symbol of what you're typing. Take a look at IME, and you'll know what I mean. It might be more difficult in the console, but with libs like ncurses, there might be a way to give a better view on the writing.
Also, shells might stop character echoing and write buffer contents instead, then clear back to the last breakpoint, write the new sign, change buffers, and write them again after the new sign. I did this once with a small Japanese console learning app, and it's fast enough that you don't see it. When pressing the enter key, I propose that we use only ready-typed input; otherwise, some shells might want to do so and some do otherwise, not giving a standard behavior.
Now I will think about one of the most-feared things in the computer world: The change of data structures and backward compatibility. First, a single machine holds its data -- text files, binaries, etc. It is most likely connected to other machines or the Internet, and that's where I begin. These new generation operating systems that fully process 32bit data from hard disk to whatever will be compiled and booted on machines, but will get data through FTP or other services that will get tons of chars of the old type. Their buffer (char[]) will be a 32bit[] array, provided by the system (because sizeof(char) returns 4!). The service now writes it to disk, but, fortunately, the system does it for us, because the system always has a fear that some applications might damage the hardware. For the newly-installed machine, there is no difference in behavior except the file sizes for texts. When the new machine provides services, there will be the problem of sending too much information. If the client wants the next byte (or char), there might be a problem if it gets a value above 255 (or 127 signed) and dumps an error message or disconnects.
In the end, there will be no progression without backward compatibility problems. Network connections are a big advantage here. We should try to use it and finally throw away our fear, because the gain will be a clear character processing solution that works world-wide, with no more hassles with encoding schemes and browser displaying problems, and a user-friendly, simple-to-use, speedy and secure multi-language interface. Also, encoding information can be left out, which cleans up email and XML files. The first big step will be to change low-level system routines, and for that I wish us all some more courage towards a change of thinking.
Real-time data plotting and processing with Python for data acquisition devices.
You are wrong!
You want to throw out ASCII, which has been the standard for
at least 20 years, and start using some big, graphical-only-make-text-mode-vanish-i-need-boldface which would make all the files
written in ASCII unreadable. And, no more textmode?
Look, ASCII was a standard that was built to last, I mean, it
displays every character in the alphabet, plus all the important
symbols we use everyday. So, you want to throw out the
standard for the last 20 years, make text-based terminals
worthless, forcing everybody to use big, graphical (propritary?)
systems. Text mode can't do any characters other than those
included in ASCII, so your willing to eliminate these, just for
boldface & character-sizing? As for your argument about japs,
It's ASCII, "American" Standard Code for Information Interchange. So, we have to stop using ASCII just for people
who write in non-european languages? And, ASCII includes
symbols for european languages (not just english), like german, french, spanish, etc. Which are what half the world
speaks. So, why can't people in the middle east, asia, etc.
Use their own standard, and let us keep ours? I mean, the
languages use completely differant character sets, so why
use the the same standard? For somebody who wants to
translate some thing from these languages, it would be a
different character set anyway. Why contain 5 charsets in
1 code? And besides, their languages use 500+ chars to do
what we accomplish in 104. Good luck trying to make a jap
language keyboard, kanji is 500 chars! Want to fit that on a
keyboard? I didn't think so. And, arabic uses 500+ chars as
well, and it's not even as specific enough to descipe a hard
drive! Well, they could, by stealing the word from our language
or making a new word. The point? these languages are
grossly inefficent. And why use the same standard? Okay,
i'm done, sorry if i've offended anyone, this is just my opinion.
It's not that bad, actually
What's wrong with UTF-8? It's an 8 bit multi-byte encoding, and therefore independent of endianess. It's absolutely compatible with plain 7 bit ASCII and easy to program for. Most of the code written for ASCII continues to work with UTF-8, e.g. substring search.
Also, it's just not true that Unicode text files start with a special character sequence. That might be a bad Windows habit, but it's not required by any standard. Once everyone moved to UTF-8, we can forget about all those character set problems.
Unlike UTF-8, we don't want different sizes for Western and Eastern characters; that makes programmers unhappy and software difficult to control. Also, UTF-8 emphasizes historical Western domination of computing science, which is not very friendly. No start and end sequences -- that's it.
That's just nonsense. UTF-8 is not hard to program for. And if you don't want to do it yourself, there are a whole lot of libraries out there that deal with it. Trust me, it's going to become the standard on GNU/Linux systems in the near future.
Regarding "Western domination": Please try to view this more pragmatically. The worst that can happen with UTF-8 over UCS-4 (the 32bit Unicode encoding), is that a full-length character needs 6 bytes rather than 4. But in practice, most characters won't need full 6 bytes -- for instance, Japanese fits just fine into UCS-2 (16 bit), right? Why should it need more than 4 bytes per character in UTF-8 encoding?
And you can't just declare ASCII obsolete. It's just fine for English, and face it: Any serious programming has to be done in English nowadays (at least Open Source programming), and I doubt this will change in the foreseeable future.
Maybe you should have a look at GTK+ and Pango. GTK+ 2.0 uses UTF-8 for all text now.
Re: It's not that bad, actually
> Unlike UTF-8, we don't want different
> sizes for Western and Eastern
> characters; that makes programmers
> unhappy and software difficult to
> control. Also, UTF-8 emphasizes
> historical Western domination of
> computing science, which is not very
> friendly.
I don't think it's unfair to give greater importance to alphabetic, or even syllabic, writing systems. We want to support those other complex character-per-word character sets, but they are purely "legacy" languages. That's not cultural bias - that's system analysis. This opinion is supported by the popularity of romanized japanese as an input system even among native speakers of japanese.
Re: It's not that bad, actually
%
%
> Also, it's just not true that Unicode
> text files start with a special
> character sequence. That might be a bad
> Windows habit, but it's not required by
At least with 16bit Unicode, there little and big endian is marked. I don't know exactly the sequence, it's a 16bit (2 bytes) scheme that
differs between Windows and Posix.
> That's just nonsense. UTF-8 is not hard
> to program for. And if you don't want to
> Trust me, it's going to become the
> standard on GNU/Linux systems in the
> near future.
> And you can't just declare ASCII
> obsolete. It's just fine for English,
That's all?
> and face it: Any serious programming has
> to be done in English nowadays (at least
That's not a serious argument. My article is about
global thinking, and all you talk about is English
in every direction.
> Open Source programming), and I doubt
> this will change in the foreseeable
> future.
That's exactly the point. If we are not willing
to leave ASCII behind us, there will never be
a clear encoding scheme, and my point was that,
when character sizes differ from one language
to another, there will also be more difficulties than
to use a standardized 32bit character width.
And, if you have an array containing characters
(a string), then every data unit is of the same
size - or not? So you have to convert it internally
to same bit width anyway to work with...
> Maybe you should have a look at GTK+ and
> Pango. GTK+ 2.0 uses UTF-8 for all
> text now.
Maybe we use UTF-8 in 2 years everywhere.
And after 2 more years, we will again ask ourselves
why there are different sizes for characters, and
then we come up with this "historical" stuff once
again and again and again...
And - who wants to use a library to write simple
strings to a file? We put so much energy in these sophisticated libraries but have no power to overcome the past mistakes to avoid the whole workaround.
What about the central parsing engine idea?
%
>
32-bit text format
(Of course, this appears right as I'm uploading
a UTF-8-native version of Yeemp...)
Problems - first off, everything has to be
audited and recompiled - malloc(strlen(src)+1)
needs to become malloc(strlen(src)+sizeof(char))
everywhere it appears.
Second: Confining things to 7bit seems wasteful.
The only reason to use 7bit is to transfer data
cleanly over 7bit-only links. 7-bit protocols
will require that the commands be in 8-bit chars,
even if the data are in 32-bit chars. To deal
cleanly with many 7-bit protocols, you'll need to
avoid using a large number of control and ASCII
glyphs in the 32-bit chars. Worse, the glyphs
to avoid differ from protocol to protocol.
Embedded nulls and CRs or LFs will break almost
any 7-bit protocol; @ signs in the wrong place
will choke SMTP; . will confuse domain resolvers;
space will confuse webservers. The characters
remaining for your encoding (and that's just after
chopping the ones that I think'll cause problems)
will probably make Base64 look pleasant. Further,
parsing a glyph index composed of discontiguous
septets in a 32-bit word will be a nuisance to
any program which has to deal with them. If
you're changing the char size, it breaks enough
stuff as-is that there's no point in trying to
get it 7-bit-clean *too*.
However, I do want one single giant character set, whether
it's 16-bit, 32-bit, or something else. Having
to tag every bit of content with encodings is
annoying (when it's text files), infuriating
(when it comes to files with multiple chunks of
data in different encodings), unfeasible (how
am I supposed to indicate the encoding for
user@host.co.uk, where 'host' is in Devanagari
and 'user' in Hangul?), and unreliable (when every
web browser comes with a list of every encoding
that any other web browser ever claimed to support...).
IMO, display, character sets, and input are
things that should be semi-independant - I don't
want to put the BFF on my emergency boot
disk, and input methods that are great for one
language may be suboptimal or terrible for
another.
Re: You are wrong!
> You want to throw out ASCII, which has
> been the standard for
> at least 20 years, and start using some
> big,
> graphical-only-make-text-mode-vanish-i-need-boldface
> which would make all the files
> written in ASCII unreadable. And, no
> more textmode?
No, I don't want to throw textmode out. I like
textmode, and, as I wrote, I want those parsers
and fonts available for both X and console.
> systems. Text mode can't do any
> characters other than those
> included in ASCII, so your willing to
> eliminate these, just for
You mean there can never be a terminal
that handles 32bit? Why not?
> your argument about japs,
> It's ASCII, "American"
> Standard Code for Information
> Interchange. So, we have to stop using
Oh man, please no rassism here.
> trying to make a jap
> language keyboard, kanji is 500 chars!
You have no clue, have you? There is a
syllable system in Japanese, and Korean
have radicals to build up characters, and
before you go on hurting the East, better
change your attitude towards other people.
Japanese parsed their language before, too.
Re: It's not that bad, actually
> At least with 16bit Unicode, there
> little and big endian is marked. I don't
> know exactly the sequence, it's a 16bit
> (2 bytes) scheme that
> differs between Windows and Posix.
I'm talking about UTF-8 text files.
> > And you can't just declare ASCII
> > obsolete. It's just fine for English,
>
> That's all?
Yes, that's all.
> > and face it: Any serious programming has
> > to be done in English nowadays (at least
%
> That's not a serious argument. My article is about
> global thinking, and all you talk about is English
> in every direction.
I'm talking about source code, not the user interface. All pilots use English to communicate with each other, for good reasons. The same is true for programmers. You need a least common denominator.
> > Open Source programming), and I doubt
> > this will change in the foreseeable
> > future.
>
> That's exactly the point. If we are not willing
> to leave ASCII behind us, there will never be
> a clear encoding scheme, and my point was that,
> when character sizes differ from one language
> to another, there will also be more difficulties than
> to use a standardized 32bit character width.
> And, if you have an array containing characters
> (a string), then every data unit is of the same
> size - or not? So you have to convert it internally
> to same bit width anyway to work with...
No. Quite the contrary: UTF-8 is meant for text, not single characters. Perhaps you should make yourself familiar with UTF-8. See www.cl.cam.ac.uk/~mgk2... (www.cl.cam.ac.uk/~mgk2....
> Maybe we use UTF-8 in 2 years everywhere.
> And after 2 more years, we will again ask ourselves
> why there are different sizes for characters, and
> then we come up with this "historical"
> stuff once again and again and again...
If we go for fixed-length characters, maybe we'll ask ourselves then why we only have 32 bit...
> And - who wants to use a library to
> write simple strings to a file? We put so much energy
> in these sophisticated libraries but
> have no power to overcome the past
> mistakes to avoid the whole workaround.
You won't need a library to write simple strings to a file.
Re: You are wrong!
> And, ASCII includes
> symbols for european languages (not just
> english), like german, french, spanish,
> etc. Which are what half the world
> speaks.
You really have no clue. ASCII is a 7 bit character set. It doesn't contain the characters needed for most European languages, e.g. German umlauts or French accented characters.
Re: It's not that bad, actually
> I'm talking about source code, not the
> user interface. All pilots use English
Why, is there a difference? Languages should
be available for the console/terminal, imho. So
one can program in non-ASCII. I don't see any
difficulties besides the recoding of given sources.
The second thing I want to point out is, that a GUI needs, for example, the next character of a string. In a standardized width, it takes the next data unit, with UTF-8, it passes it to a library which has to parse it. It takes time and is not very secure.
When (by whatever) a byte gets lost, then further parts of the string are handled incorrectly, because
following bytes are misinterpreted. Or take functions on strings - it gets more complicated then it needs to be.
Re: It's not that bad, actually
> Why, is there a difference? Languages should
> be available for the console/terminal, imho.
Correct.
> So one can program in non-ASCII. I don't see any
> difficulties besides the recoding of given sources.
No, you can't. I'm talking about Open Source, and source code not written in English is not open.
> The second thing I want to point out is,
> that a GUI needs, for example, the next
> character of a string. In a standardized
> width, it takes the next data unit, with
> UTF-8, it passes it to a library which
> has to parse it. It takes time and is
> not very secure.
Time is less of a problem than memory nowadays. Also, I'm not sure what you mean with "security".
> When (by whatever) a byte gets lost,
> then further parts of the string are
> handled incorrectly, because
> following bytes are misinterpreted. Or
> take functions on strings - it gets more
> complicated then it needs to be.
UTF-8 is designed in way that allows recovering after encoding errors without synchronization sequences. Basically, all bytes of a multi-byte sequence have bit 7 set, i.e. are above 0x80.
And GUI apps will rarely need to process a string character-wise, except perhaps word processors. Most of the time, you'll get a string via gettext(), display it, and forget about it. Anything else requires thinking about internationalization, no matter whether you use UTF-8 or UCS-4 or whatever. For instance, the next character in a string might not be a character on its own, but part of a combining sequence. With internationalized text, you can't just arbitrarily pick out single characters or substrings and interpret them out of context.
Re: You are wrong!
> 1 code? And besides, their languages use
> 500+ chars to do
> what we accomplish in 104. Good luck
> trying to make a jap
> language keyboard, kanji is 500 chars!
> Want to fit that on a
> keyboard? I didn't think so. And, arabic
> uses 500+ chars as
> well, and it's not even as specific
> enough to descipe a hard
> drive! Well, they could, by stealing the
> word from our language
> or making a new word. The point? these
> languages are
> grossly inefficent. And why use the same
> standard? Okay,
The Western alphabet as we know now is obsolete. The binairy system is the new character set.
I.e. why have the Western alphabet when we have binary? I mean, we can encode every letter into binary. That means we only need a 2-button keyboard!
Do you get the point? People like to communicate in their native language. Maybe yours is not the same as mines, and maybe mines is not as "efficient" as yours, but I still like to use it.
Re: You are wrong!
> I.e. why have the Western alphabet when
> we have binary? I mean, we can encode
> every letter into binary. That means we
> only need a 2-button keyboard!
Bah, we only need 1 button and a frequency slider :-P
sizeof(char)
> Problems - first off, everything has to
> be
> audited and recompiled -
> malloc(strlen(src)+1)
> needs to become
> malloc(strlen(src)+sizeof(char))
> everywhere it appears.
Do you mean malloc(strlen(src)+1) should become malloc((strlen(src)+1)*sizeof(char)) everywhere? In any case it doesn't matter since the C standard requires that sizeof(char) be 1. Any compiler which doesn't make sizeof(char) 1 is non-conforming. This makes sense since the sizeof operator returns the size of its operand in chars. How many chars are there in a char? Well 1 of course.
Of course this does not preclude 32-bit chars. There are lots of implementations out there where CHAR_BIT is 32.
Using 32-bit wchar_ts would make more sense than making 32-bit chars though.
> However, I do want one single giant
> character set, whether
> it's 16-bit, 32-bit, or something else.
I agree. We already have such a thing, though: UCS-2 and UCS-4. Strings of Unicode characters that are uniformly 16 bits or 32 bits (respectively) in size. I didn't see the author proposing anything that UCS-4 wouldn't fix.
I'd rather use UTF-8 than UCS-4, though.
What about typesetting?
This article only covers the characters and fonts involved in outputting text written in different languages. That's a problem in multilanguage environments like the internet, but it's not the only one.
The output has to be displayed correctly as well. Western languages are written and read from left to right. Most text printing routines can only handle that way of outputting text. But it's not the only one. Arabic text goes from right to left (just the opposite of western languages). And if I remember corretly, some eastern languages are even written from top to bottom, that is, in colums rather than lines.
So you have to think about a way to handle output as well. A graphical system can print single characters correctly, but they need to be alligned correctly to form a text that makes esnes - sorry, I meant "sense".
I think that this becomes a serious problem if you build a system that is capable of printing all those characters at the same time. How do you typeset a text that contains passages in English, Arabic and Chinese?
16 bits is enough
It's really a myth that 16 bits isn't enough. People make statements about how there's some huge number of Chinese characters, too many for 16 bits. But the reality is that most of these characters are names that parents have had experts create for their children in order to be able to name them something special. Nobody can read these characters. (Heck, as an American, I could name my kid with some weird symbol, and then complain that it wasn't implemented on computers.) The typical Chinese person knows a relatively small number of characters, and even highly educated people know far less than 2^16.
Re: You are wrong!
> You want to throw out ASCII, which has
> been the standard for
> at least 20 years, and start using some
> big,
> graphical-only-make-text-mode-vanish-i-need-boldface
> which would make all the files
> written in ASCII unreadable. And, no
> more textmode?
There was no mention of eradicating textmode. In fact, making textmode international WAS a main issue of the article.
And also, I don't think that this h4x0r-ish paranoid fear of development is appropriate for a community which develops (hopefully) widely used applications. It IS alright for h4x0rs and self-centered geek-bullies who rejoice in picking on beginners and showing their elitism off, since development may make their skills less arcane to the average human...
> As for your argument about japs,
> It's ASCII, "American"
> Standard Code for Information
> Interchange. So, we have to stop using
> ASCII just for people
> who write in non-european languages?
Well, the whole issue is communication BETWEEN nations. I understand that learning languages is not very popular in the US, but then again, it IS a country (empire?) of many nationalities, so I don't think an 'american' standard should stop with english. There are huge communities of germans, chinese, japanese and so on living in the US.
And also, face it, not all communication around the world happens in English. I study Japanese also, and know that the English skill of the average japanese SUCKS. It is certainly the schools to blame, but still, if I want to communicate in an intelligent way, it's Japanese or nothing.
So, guess there is a US citizen of german heritage who speaks Japanese. He has netpals from Japan, who don't speak English too well. Now, he is faced with the problem of having to have one plus two different charsets and input methods: English for programming and work, German for keeping in touch with his relatives, and Japanese. I guess you know the workings of Linux i18n, and know that this is a massive sucker, since everything is done via environment variables... If I want to type in Japanese, I set the vars, and start a compatible text processor (well, not to mention the input parser daemons). On a long-term basis, this is a totally unusable construction for multilanguage platforms (that is, not English + Other, but Other + Other platforms).
See, I say i18n IS neccessary, on a level as low as possible - like, say, system-wide I/O daemons with a unified encoding on kernel-level.
Of course, still, we can go off in two directions: the American way, and the 'Terran' way. But this just wouldn't lead to anything nice. For the USA, mainly.
> So, why can't people in the
> middle east, asia, etc.
> Use their own standard, and let us keep
> ours? I mean, the
> languages use completely differant
> character sets, so why
> use the the same standard? For somebody
> who wants to
> translate some thing from these
> languages, it would be a
> different character set anyway.
The answer was already provided. One person can speak more than one language. One terminal can be used by people of different nationalities. One server can be accessed by clients from different countries.
That's why.
> Good luck trying to make a jap
> language keyboard, kanji is 500 chars!
> Want to fit that on a
> keyboard? I didn't think so.
Kanji (the widely used ones, that is) is not 500, but approximately 2000 chars, plus 2*46 for the kana, not counting the nigorization.
Still, it CAN be input into a computer, with a classic roman alphabet keyboard, using semi-intelligent input methods which convert the phonetically entered text into the correct characters.
And also, there ARE 'real' Japanese keyboards, but they use kana, which is, in turn, converted into kanji on-line by input methods.
By the way, your comment was not funny, and taking it a step further, it was downright stupid. Did you actually think, that Japan, being THE most advanced nation in terms of digital technology and communication, uses English on its computers? :)
> The point? these languages are
> grossly inefficent. And why use the same
> standard? Okay,
> i'm done, sorry if i've offended anyone,
> this is just my opinion.
Well, I think measuring natural languages for efficience is a dead effort. I am Hungarian, and we can express things in a single word, which would take a whole sentence in English. A text which would take up an entire page in English, can be put down in a few lines in Japanese, thanks to the compact writing. Well, learning English language IS much simpler than Hungarian, and the mastering the roman alphabet IS way easier than mastering the Kanji. See, efficience is such a complex idea, that rational measurement is a bit out of reach. Let's not judge ANY language 'inefficient', 'obsolete', or 'optimal'. All have their pros and cons. It's a matter of situation and personal preference.
Re: It's not that bad, actually
>
> No, you can't. I'm talking about Open
> Source, and source code not written in
> English is not open.
>
Really? I believe Open Source is when sources are available.
If you think so, I could say programs written in Perl cannot be Open Source for me because I don't know that programming language.
It think having a common language to communicate is good, but your argument is wrong.
Alessandro.
Re: You are wrong!
Obviously he has no clue -- but you do not seem to have either. Talking about
new 32-encodings all the time, and not even mentioning existing UCS-4. Sorry,
but maybe you should inform youself first before writing such articles...
BTW, not speaking English natively, I still believe that it deserves it's
position in the computer world. (It's a nice pragmatical language.) Everyone
wins when there is less confusion about languages -- no place for nationalism
here! But that's rather OT...
Re: You are wrong!
And, by the way, making a system with no INHERENT ASCII compatibility doesn't nearly mean making a system with ABSOLUTELY no ASCII compatibility. ASCII is a very simple code, and writing a conversion module is more than easy. I think dropping the inherent compatibility wouldn't even be apparent to the simple user, he could still load and save ASCII files like before, all the work could be done by transparent filters on a kernel level.
Whining over the possibility of 'losing' ASCII just doesn't make a sense. Nothing would be REALLY lost, and much would be gained.
Re: It's not that bad, actually
> Really? I believe Open Source is when
> sources are available.
> If you think so, I could say programs
> written in Perl cannot be Open Source
> for me because I don't know that
> programming language.
It's much easier to learn Perl than, say, Japanese. At least for a programmer. Also, there are much less programming languages out there than spoken languages. Besides, I think it's hard to express something technical in most spoken languages anyway.
> It think having a common language to
> communicate is good, but your argument
> is wrong.
Why do you think it's wrong? Be realistic. If someone sends a code snippet to a mailing list, asking for help, (s)he should better write it in English or (s)he will be ignored. The GNU coding standards advise you to write your source code in English for a reason.
Re: What about typesetting?
> The output has to be displayed correctly
> as well. Western languages are written
> and read from left to right. Most text
> printing routines can only handle that
> way of outputting text. But it's not the
> only one. Arabic text goes from right to
> left (just the opposite of western
> languages). And if I remember corretly,
> some eastern languages are even written
> from top to bottom, that is, in colums
> rather than lines.
>
> So you have to think about a way to
> handle output as well. A graphical
> system can print single characters
> correctly, but they need to be alligned
> correctly to form a text that makes
> esnes - sorry, I meant
> "sense".
>
> I think that this becomes a serious
> problem if you build a system that is
> capable of printing all those characters
> at the same time. How do you typeset a
> text that contains passages in English,
> Arabic and Chinese?
Umm, I think these problems have in large parts already been solved by systems like Pango etc. Are you familiar with Pango?
Re: It's not that bad, actually
>
>
> % Really? I believe Open Source is when
> % sources are available.
> % If you think so, I could say programs
> % written in Perl cannot be Open Source
> % for me because I don't know that
> % programming language.
> [snip]
> % It think having a common language to
> % communicate is good, but your
> argument
> % is wrong.
>
>
> Why do you think it's wrong? Be
> realistic. If someone sends a code
> snippet to a mailing list, asking for
> help, (s)he should better write it in
> English or (s)he will be ignored. The
> GNU coding standards advise you to write
> your source code in English for a
> reason.
>
The point he's making is that the language the code is written in has nothing to do with whether the code is Open Source or not. That's a licensing issue. Your original statement about code not being open if it's not written in English is factually incorrect. It's true that it's customary for people to write open source in English, but that's a completely different issue.
Re: It's not that bad, actually
> The point he's making is that the
> language the code is written in has
> nothing to do with whether the code is
> Open Source or not. That's a licensing
> issue. Your original statement about
> code not being open if it's not written
> in English is factually incorrect. It's
> true that it's customary for people to
> write open source in English, but that's
> a completely different issue.
OK, I agree. There is no law that demands Open Source code has to be written in English. But still, even though I think it should be possible to write source code in other languages, it just doesn't fit into my idea of open.
Re: You are wrong!
> > So, why can't people in the
> > middle east, asia, etc.
> > Use their own standard, and let us
> > keep ours? I mean, the
> > languages use completely differant
> > character sets, so why
> > use the the same standard? For
> > somebody who wants to
> > translate some thing from these
> > languages, it would be a
> > different character set anyway.
%
> The answer was already provided. One
> person can speak more than one language.
> One terminal can be used by people of
> different nationalities. One server can
> be accessed by clients from different
> countries.
> That's why.
Exactly.
I just want to add: You (the guy who started this thread) obviously never did any serious programming. Otherwise you'd know what I means to make a program work with different languages, locale settings, character sets and so on.
Sorry, UTF-8 *is* the way to go
Rather than re-writing virtually all existing source code, it makes infinitely more sense to go with UTF-8. In fact this decision has already been made, and disparate operating systems from windows to linux (in a big way linux) are slowly standardizing on it.
Utf-8 gives you compatibility with ascii, full access to the full 31bit unicode (unicode saves the one extra bit for error codes, sign bits etc, very smart!), an error-recoverable byte stream, stateless computationally trivial conversion, very low overhead for most existing text, ovewhelming compatibility with existing software (no code changes for most software!!), relativly trivial string width computation
see for yourself: "www.cl.cam.ac.uk/~mgk2...;
Using UCS-4 would be a huge headache with few benefits. It would also introduce all new kinds of bugs, like for example assuming than number of ucs-4 chars would equal the display width of the string (not true, see combining chars, zero width glyphs, etc)
Two different point of view
I think character set issue is mostly related to documents and graphical user interface rather than coding and consoles.
For the latter ASCII may be enough (does it make sense translate shell commands and language keywords? MS Excel say yes, but I'm not sure it is so useful).
Low level access to the system can still be ASCII, while for higher level interfaces we can use i18n and libraries to handle it.
Re: You are wrong!
> And, by the way, making a system with no
> INHERENT ASCII compatibility doesn't
> nearly mean making a system with
> ABSOLUTELY no ASCII compatibility. ASCII
> is a very simple code, and writing a
> conversion module is more than easy. I
> think dropping the inherent
> compatibility wouldn't even be apparent
> to the simple user, he could still load
> and save ASCII files like before, all
> the work could be done by transparent
> filters on a kernel level.
User-transparent character set conversion is a Bad
Thing - it requires that *all* data carry encoding
markers, and allows plenty of room for seamless
data corruption (what happens when you save a file
after pasting data in a different encoding in? Does
the UCS-4 Han supplemental plane glyph in
someone's name get saved in /etc/passwd as UTF-8,
or does /etc/passwd become a UCS-4 or UTF-16
document when you're not looking? Or does it get
translated into ASCII or UTF-7? What if the translation
includes ':'?)
Character set conversion in the kernel is a Terrible
Thing. It opens up room for all sorts of problems
(like those IIS path traversal bugs that use overlong
UTF-8 sequences to take advantage of someone's
broken UTF-8/UCS-2 converter.), wastes resources,
and will probably slow down many operations that
have no business knowing or caring what character
set you use.
Why is this taking the form of a religious debate?
UTF-8 should be used for a text storage format. Why? Since the vast majority of documents in the world, when saved, will take up less space on the hard drive. Let's face it, as has been mentioned elsewhere, most of the text files out there are compatible with ASCII. This is not racism or imperialism. It's pragmatism.
String manipulation within programs (in-memory) is a whole different story, however. Here, a fixed-size character makes more sense in most cases. While the case for the program that simply reads data in and spits it out again verbatim is a failrly common one, the vast majority of real-world programs manipulate character strings during the course of processing.
To be more clear about this, getting the Nth character of a fixed-char-size string is a constant-time operation (O1) and takes the same amount of time whether N is equal to 5 or 500. On the other hand, getting that same character in a variable-width-char string is a linear operation (On) and takes approximately one hundred times as long to get the 500th character versus the 5th character. The same holds true for substring operations. Character replacement gets a lot more tricky as well. If all characters are treated the same, what happens when someone tries to replace the Nth character (which for the sake of this exercise, we'll call 'c') with a Thai character? In a fixed-width character string, this works just as in an ASCII string. In a variable-width character string, a lot of extra processing and data movement is necessary or subsequent characters will get overwritten due to the size difference between the two characters.
To ignore the existing base of data is a recipe for the existing base of programmers and writers to ignore you.
The first step is UTF-8 compatibility. This is a minor change to most programs. Without some tie to a universal character encoding, i18n is impractical for all intents and purposes.
The second step is to impress upon programmers the algorithmic efficiency advantages of using fixed-width characters in their programs instead of variable-width.
Finally, as time goes on and new programs are created -- especially with the advent of newer languages that encourage better i18n behaviour -- people may find it easiest to save their data in UCS-2 or UCS-4 because that is the serialized form of their in-memory data structure. Then and only then will you see the widespread changeover. Anything else is pissing into the wind.
Will it happen overnight? No. Will it happen in our lifetimes? Probably. Is it worth it to rip apart all of our existing infrastructure, effectively stop all new development, effectively halt all existing development, and recode everything right now with 4-byte characters? I certainly hope that you say 'no'. Otherwise you will be advocating the betterment of society by making it a horrible place to live.
Re: 16 bits is enough
You're almost right that you don't need more than 2^16 chars for daily life,
but we Japanese have historical documents and classic literatures.
We can't change names of existing or existed people and places
just to fit them in 2^16 charset.
Unless you suggest us to abandon our history, 2^16 is not enough.
(If you mean surrogating pair in UTF-16, yes, it's enough
so far to
represent those rare characters in the stream of 2^16 codes.)
Re: 16 bits is enough
> the reality is that most of these
> characters are names that parents have
> had experts create for their children in
> order to be able to name them something
> special.
What have you been smoking? I want some! I both speak and write Chinese fluently and this is the first time I have heard of such nonsense.
Charles
Your title, "True Internationalization" holds the key to the requirements.
Good and very relevant topic. (Stepping up on soapbox)
If it is true internationalization you are after then the only good existing answer is UTF-8 which allows multiple languages in the same document without context sensitivity for characters based on position in the document. If you really require 4 byte representation within your program you can alway convert to UCS-4 while you do your private magic.
Please read up on all the existing choices, If you do, I suspect that you will see the many advantages of UTF-8.
This also provides the advantage of being usable for multiple languages on machines that were designed for 8 bit ascii charaters without even requireing unicode conversion routines (as long as you only use ascii and utf-8). This is absolutely brilliant for embedded devices where space is still an issue.
Unfortunatly 32 bit chars do require conversion libs and will be context sensative because you cannot fit all the possible characters for all languages into a single 32 bit code space. - Or perhaps you are proposing that some people's languages are not important enough to include?
(This requires special codes to switch to a new code space. The resulting context problems are a much more severe programming problem than different byte lengths for various characters.) Also you can program your Open (or closed) source programs in utf-8 today with more than one language used in the source and it works fine, (even on my ancient systems).
You are exactly right about needing full international support in computers today. From what I have seen the people doing real work in this area go to UTF-8. You can probably tell that it is my choice as well.
(A question for Unicode wizards: why is there a common practice to be converting utf-8 to ucs-4 for storage? To me utf-8 seems to be ideal for both storage and use within programs.)
Regards,
Curtis
Re: It's not that bad, actually
%
> OK, I agree. There is no law that
> demands Open Source code has to be
> written in English. But still, even
> though I think it should be possible to
> write source code in other languages, it
> just doesn't fit into my idea of
> open.
>
Oh really? Doesn't it rather depend on who will be using and writing the program? Wouldn't it make more sense for, say, an application that computed German telephone rates, to be in German? If everyone writing and using the app are German speakers, why should they use English?
Re: It's not that bad, actually
> Oh really? Doesn't it rather depend on
> who will be using and writing the
> program? Wouldn't it make more sense
> for, say, an application that computed
> German telephone rates, to be in German?
> If everyone writing and using the app
> are German speakers, why should they use
> English?
That's a typical example of an in-house application. And if it's not, there is a good chance of it to be applicable to other countries too.
Please don't get me wrong, I don't mind if someone writes non-English source code. But if you do so, you just can't expect any help from the world-wide software community. And that's what Open Source is about, isn't it?
Also, I happen to be a native German speaker, and I can tell you some purely pragmatic reasons to write your code in English:
a) programming languages use English keywords
b) all openly available libraries (I know of) use English identifiers
c) it's just easier to express technical problems in English
d) if you're into programming, you have to know English anyway
e) it trains your English skills
f) you're just used to
Re: You are wrong!
> You really have no clue. ASCII is a 7
> bit character set. It doesn't contain
> the characters needed for most European
> languages, e.g. German umlauts or French
> accented characters.
>
Wrong. There is an 8-bit ASCII, which supports these.
Re: You are wrong!
> > You really have no clue. ASCII is a 7
> > bit character set. It doesn't contain
> > the characters needed for most European
> > languages, e.g. German umlauts or
> > French accented characters.
%
> Wrong. There is an 8-bit ASCII, which
> supports these.
No, there is no 8 bit ASCII. Period.
There are ASCII-based 8 bit character sets. For instance, ISO 8859-1, ISO 8859-2, and so on, also the recently introduced ISO 8859-15. You get the idea?
Re: You are wrong!
> The Western alphabet as we know now is
> obsolete. The binairy system is the new
> character set.
>
> I.e. why have the Western alphabet when
> we have binary? I mean, we can encode
> every letter into binary. That means we
> only need a 2-button keyboard!
0010001111110000011100000011010101010100011
11000001100100101010010101010010010100101001
01001010100101010100101010000011111000110101
100010010100110010101010000011101000100011101
Re: You are wrong!
> No, there is no 8 bit ASCII. Period.
>
> There are ASCII-based 8 bit character
> sets. For instance, ISO 8859-1, ISO
> 8859-2, and so on, also the recently
> introduced ISO 8859-15. You get the
> idea?
>
Oops, you're right, sorry for any confusion this may have
caused.
Re: You are wrong!
> Oh man, please no rassism here.
That's "racism", and none was intended. I admire many
japanese people. I was refering to the japanese language,
not the people.
Re: You are wrong!
> Exactly.
>
> I just want to add: You (the guy who
> started this thread) obviously never did
> any serious programming. Otherwise you'd
> know what I means to make a program work
> with different languages, locale
> settings, character sets and so on.
>
Why would I want to make my program work with
other languages? If my program is open source, they can
make their own language version.
Re: You are wrong!
> Why would I want to make my program work
> with other languages? If my program is open
> source, they can make their own language version.
Now, really, come off it. You aren't seriously proposing to write different versions of an application for every language? Frankly, I really hope you're just kidding, perhaps laughing at the irritated replies.
So, in case you aren't just trolling: The idea is to make it as easy as possible for you to internationalize your applications. Most of the time, you won't need much more than gettext(). And of course being a bit less ignorant would be helpful, too.
Re: Why is this taking the form of a religious debate?
> To be more clear about this, getting the
> Nth character of a fixed-char-size
> string is a constant-time operation (O1)
> and takes the same amount of time
> whether N is equal to 5 or 500. On the
> other hand, getting that same character
> in a variable-width-char string is a
> linear operation (On) and takes
> approximately one hundred times as long
> to get the 500th character versus the
> 5th character.
> The second step is to impress upon
> programmers the algorithmic efficiency
> advantages of using fixed-width
> characters in their programs instead of
> variable-width.
this is not necesarrily true: with combining characters, bidirectional text, and other unicode features you will still need to do the same amount of work with wide characters as you do with multibyte characters.
An additional benefit of UTF-8 internal use is byte-order independance, which bypasses a perrennial problem faced when making code portable.
Re: Your title, "True Internationalization" holds the key to the requirements.
>
> Unfortunatly 32 bit chars do require
> conversion libs and will be context
> sensative because you cannot fit all the
> possible characters for all languages
> into a single 32 bit code space. -
> Or perhaps you are proposing that some
> people's languages are not important
> enough to include?
> (This requires special codes to switch
> to a new code space. The resulting
> context problems are a much more severe
> programming problem than different byte
> lengths for various characters.)
>
How many languages do you think there are? 32 bits would support 40 million languages with 1000 characters each! Admittedly, there are some languages with more than 1000 characters, but it's also true that many languages share a character set.
What I could use!
I started my computing in the world of Macs.
That was nice because I write (to date mainly in roman alphabets) in 4
languages and the Mac let me add optional characters and the text
looked good.
Now on the Windoze side - life is miserable - I can even find a neat
way to switch so that I can use the diacritical characters from
German, French or Italian and when it comes to Russian, which I have
just started working on, I am totally confused as to how to switch and
by that I mean switch back and forth.
This proposal and even some of the suggested alternatives still do NOT
address the ability to add characters or blocks of text in another
language easily.
If there is some way of doing this (easily and intuitively!!!!!)
please let me know. If what you know is a work around involving
changing keyboard layout for each language - I am not really
interested for as a fast touch typist I don't want to learn a large
number of new keyboard layouts!!
Getting off my soapbox now.
machinisttenor
Re: What I could use!
sorry just correcting a spelling mistake from the post!!
> I started my computing in the world of
> Macs.
>
> That was nice because I write (to date
> mainly in roman alphabets) in 4
> languages and the Mac let me add
> optional characters and the text
> looked good.
>
> Now on the Windoze side - life is
> miserable - I can't even find a neat
> way to switch so that I can use the
> diacritical characters from
> German, French or Italian and when it
> comes to Russian, which I have
> just started working on, I am totally
> confused as to how to switch and
> by that I mean switch back and forth.
>
> This proposal and even some of the
> suggested alternatives still do NOT
> address the ability to add characters or
> blocks of text in another
> language easily.
>
> If there is some way of doing this
> (easily and intuitively!!!!!)
> please let me know. If what you know is
> a work around involving
> changing keyboard layout for each
> language - I am not really
> interested for as a fast touch typist I
> don't want to learn a large
> number of new keyboard layouts!!
>
> Getting off my soapbox now.
>
> machinisttenor
>
Re: Why is this taking the form of a religious debate?
Combining characters are not as much an issue if you enlarge your character size. Assuming a 32-bit character with one bit excluded for internal use (effectively a 31-bit character) you have 2.15 billion characters from which to choose. You will forgive me if I don't see an immediate and urgent limitation.
"Planning for the future" is not really viable in that the future tends to draw out possibilities previously unseen. You have to work with what you've got. What happens if we have visitors from outer space and must incorporate their characters (and the other billion species')? No, not very likely. Not likely at all. But it would totally invalidate any "perfect" solution that we might come up with today. I imagine there are other possibilities not as remote as the "alien contact" mentioned above that would still put a monkey-wrench in a "perfect" character-encoding solution.
As for bi-directional text (misleading when you consider that text in the world has more than two possible directions -- think up/down), the display order of the text is not necessarily the logical order. Just because the display of the text is right-to-left (for example) doesn't mean that the characters must be kept in memory in contrary order to left-to-right characters. It just means that the first character in a block is rendered on the right instead of the left. It does not have to be a BiDi text issue.
You're right that byte order could be an issue on some platforms when serializing strings. Explicit serialization to UCS-2 or UCS-4 is needed. Point taken.
I almost forgot...
Just in moving people away from ASCII and the assumption that characters should be treated as a single byte, you will have solved 90% of the problem. Honestly, who cares what universal encodings are out there as long as you recognize that other encodings exist. Once a programmer recognizes that encodings (besides ASCII) exist and are worth supporting, the simple/dumb routines for character input will start to fade into the background.
Once the data's in memory, who cares what format it's in? The only ones who must interoperate with in-memory strings are the developers and the maintainers and they should be using whatever encoding best fits the task at hand. In some cases, it'll be UTF-8. In others, it may be UCS-4. In others, Shift-JIS may fit the bill. As long as there is a conversion routine to an from Unicode from whatever you are using in your program, you can get from any encoding in the world to any other.
But of course, the hard part so far is getting C coders to accept a non-ASCII world. This is not spite toward C for spite's sake. C (and its derivatives) are some of the last popular holdouts that (a) have little i18n and l10n support in the standard language and (b) makes little distinction between the concept of a character/string and a byte-array that represents a character/string. If it were more of an abstraction -- putting in a different back-end and letting the compiler deal with these details -- the world would be a better place (most likely with fewer buffer overflow vulnerabilities as well).
Japanese is propritary
> Now, really, come off it. You aren't
> seriously proposing to write different
> versions of an application for every
> language? Frankly, I really hope you're
> just kidding, perhaps laughing at the
> irritated replies.
>
> So, in case you aren't just trolling:
> The idea is to make it as easy as
> possible for you to internationalize
> your applications. Most of the time, you
> won't need much more than gettext(). And
> of course being a bit less ignorant
> would be helpful, too.
Japanese does not adhere to standards, it is propritary,
not using the roman alphabet, is read from right to left.
uses both kana and kanji, having two sets of char for
the same thing. It's fine to make your own language, just
use the standards. japanese is only spoken in one place,
japan, and is not the offical language anywhere else.
It uses it's own propritary charset, and a propritary method
of reading (right to left), so japanese is not open! The
roman alphabet is an open standard. Japanese is as closed
as microsoft windows.
Re: Japanese is propritary
> Japanese does not adhere to standards, it is propritary,
> not using the roman alphabet, is read from right to left.
> uses both kana and kanji, having two sets of char for
> the same thing. It's fine to make your own language, just
> use the standards. japanese is only spoken in one place,
> japan, and is not the offical language anywhere else.
> It uses it's own propritary charset, and a propritary method
> of reading (right to left), so japanese is not open! The
> roman alphabet is an open standard.
> Japanese is as closed as microsoft windows.
Well, the world's most popular language is probably Chinese...
Re: What I could use!
I'm not too much into this, but I believe Pango (www.pango.org/) addresses this particular problem. If you right-click on an entry box in GTK+ 2.0 (www.gtk.org/), a context menu appears which allows you to switch input methods on the fly. I haven't really tried it yet though (you probably need special hardware for it, dunno), but could that be what you're looking for?
Re: Why is this taking the form of a religious debate?
I do agree that fixed-length character string representation is more efficient than variable-length one.
But I think it's not so bad as some claim.
> To be more clear about this, getting the
> Nth character of a fixed-char-size
> string is a constant-time operation (O1)
> and takes the same amount of time
> whether N is equal to 5 or 500. On the
> other hand, getting that same character
> in a variable-width-char string is a
> linear operation (On) and takes
> approximately one hundred times as long
> to get the 500th character versus the
> 5th character. The same holds true for
> substring operations.
I wonder if this is really an issue. When I process the text, what I do mostly is either scan the text sequentially, or use some search operation (like substring match or regexp match). Indices are returned as the result of the search operation so that I can extract the matched region from the string, but they're not necessary to be a character index---any kind of pointer does the job, and it's possible to create such pointer object that access any part of string in constant time.
I hardly see the case that I have to apply a pre-determined character index that is not a result of search operation. Maybe my experience is limited.
The searching operation for variable-length characters is slower than the one for fixed-length. That's an issue. But it's not as bad as O(N) versus O(1).
%Character replacement gets a lot more tricky as well.
Yes, it is tricky. But again, how common is the operation to substitute characters in place? When I use the programming languages that don't have automatic memeory management, I tend to do the in-place replacement as much as possible.
But in many situations the size of replacement string differs from the size of original region and I end up reallocating whole string.
If the language supports garbage collection, I even think that prohibiting in-place replacement is more beneficial,
because it allows me to share substrings without worrying that the shared storage will be modified inadvertently.
(I assume I use some kind of "string object" that has the pointer to the storage of actual string).
I don't have any concrete performance comparison between fixed- vs variable-length string representation.
Excuse me if my idea is irrelevant.
Why must you kill ASCII?
I tried to defend ASCII. I love ASCII, all my files are in it.
I don't like these ideas about unicode, etc. Because not only
would we have to translate everything from ASCII, It would
require everything to be 2-bytes per char. When I said 8-bit
ASCII, I meant the IBM extended characacter graphics set.
It works in teminal mode. Also, japanese characters
are not part of the characters in the fixed-mode terminal
graphics. I suppose someone could start requiring those
non-fixed computers only, the current character mode is
out-of-date, but it is still a great mode, and I would hate to
see it or ASCII lost. ASCII uses 1-byte per character to
encode. And it doesn't use all 255 different states. So, why
don't we just fill in the empty states with the rest of the roman
alphabet? That would solve the problem with all european
languages. I know that the roman alphabet is supported by
text mode. I'm sure I saw german characters in lynx.
Japanese characters are not supported on the console, so
if a standard that included japanese became the standard,
no more console, at least as we know it! So, why can't all the
eastern languages define their own standards? I mean,
if we include all 2000 japanese characters, plus arabic, plus
chinese, etc. in one standard, all the chars would need 4-byte
encoding! I know you're going to say: Our 80-gb computers
can do that now. But that's just a waste of space, i mean,
everybody wants to make more wasteful program formats
that are wasteful, just so they can make all the updates in
speed worthless. Text is the one standard that has not
bloated, and now you want to mess it up? Well, before you
do, consider this: All the text files would have to be translated,
and that would be a nightmare. We could never get all of
them translated into UCF-4 or UTF-8. Remember before
ASCII was defined? All the word processors used their own,
incompatable formats, and when one word proccesor went
out, all those files had to be translated into the new word
processors code. But then, somebody decided that there
should be a lasting standard, that sombody created ASCII.
and then, all the files that were written in ASCII never had
to become unreadable. ASCII was meant to last forever.
So, if you destroy ASCII, those files are worthless. And the
dream of having a lasting format is dead, sending us back
into the anarchy of no clear standard. Destroy ASCII, and you
incite people to keep creating new standards, until no clear
standard exists. And, all those ASCII text editors, Vi, Emacs,
Nedit, Xedit, etc., to become worthless as well? And why has
no one mention'd HTML as a new standard? It has what
you're looking for. Please, don't take ASCII away.
Re: Why must you kill ASCII?
Hi,
> I tried to defend ASCII. I love ASCII, all my files are in it.
And I liked my abacus. Though no-one destroyed it, I don't get
to use it anymore ;-)
> I don't like these ideas about unicode, etc. Because not only
would we have to translate everything from ASCII,
If *all* your data is US-ASCII, why are you worried. A simple
conversion tool would do the job.
[SNIP]
> So, why can't all the eastern languages define their own
standards?
See below, towards the end, for your own contradiction...
> I mean, if we include all 2000 japanese characters,
plus arabic, plus chinese, etc. in one standard, all the chars
would need 4-byte encoding! I know you're going to say: Our
80-gb computers can do that now. But that's just a waste of
space, i mean, everybody wants to make more wasteful program
formats that are wasteful, just so they can make all the
updates in speed worthless.
Real life seems to disagree with you; if we went by your token
we definitely would not need anything better than 4-bit CPUs
(hell, maybe 2-bit would be simpler --a toggle switch ;-)
Have you noticed how compiled code size increased over the
years --take for example Linux kernel, MS Office, or any such
thing which basically does the same thing that it did years ago.
Why? Well, unfortunately we are not all the same, and hence
differing needs which we need taken care of.
> Text is the one standard that has not bloated, and now you
want to mess it up?
'Text', as you call it, *is* already messed up --unless your
world is very constrained or confined to one single codepage
you will find that you will not be able to do much with it.
> Well, before you do, consider this: All the text files would
have to be translated, and that would be a nightmare. We
could never get all of them translated into UCF-4 or UTF-8.
Why is it such a nightmare? I do this almost every time I
open a text that was saved in a different codepage..
> Remember before ASCII was defined? All the word processors
used their own, incomparable formats, and when one word
processor went out, all those files had to be translated
into the new word processors code.
Do you listen to what you're saying? Only a couple of parags
above you were requesting every nation (lang community) to
define *their own* standard. How is it different from your
gloom definition of each word processor having their own
format?
> But then, somebody decided that there should be a lasting
standard, that somebody created ASCII. and then, all the files that were written in ASCII never had to become unreadable.
Luckily, that someone was a lot more foresightful, for his/her
time, than you are now; or else we'd be stuck with millions
of applications using what each thought best and nothing in
common.
> ASCII was meant to last forever.
So was Roman Numerals. Get over it.
> So, if you destroy ASCII, those files are worthless.
No. Data will live on. Only the app that use them have to
be revised.
> And the dream of having a lasting format is dead, sending
us back into the anarchy of no clear standard.
Dreams... Just that, pipe dream and a very short-sighted one
at that.
> Destroy ASCII, and you incite people to keep creating new
standards, until no clear standard exists.
You seem to have missed the point: What prompted the original author was the fact that, as things stand, there is no true standard that takes care of things
> And, all those ASCII text editors, Vi, Emacs, Nedit, Xedit,
etc., to become worthless as well?
emacs... about 50 megs of code for a text editor... was it you
who was complaining about code bloat?
> And why has no one mention'd HTML as a new standard? It has what
you're looking for.
HTML has its uses, but is not what you think it is.
> Please, don't take ASCII away.
No one is going to. Rest assured. Much like the --by now-- proverbial
Roman Numerals, you can have it. But you will want to move
on, the moment you want to do anything universally useful.
Cheers,
Adem
Re: Why must you kill ASCII?
> Do you listen to what you're saying?
> Only a couple of parags
> above you were requesting every nation
> (lang community) to
> define *their own* standard. How is it
> different from your
> gloom definition of each word processor
> having their own
> format?
Very different. I'm saying that the languages that don't use
the roman alphabet should define their own standards.
Like, arabic, japanese, korean, chinese, hebrew, they should
define thieir own standards. Because they don't use the
roman alphabet, or any characters from the roman alphabet,
there would be no advantage to adding too many charsets.
roman alphabet text and those langs above do not share any
letters, so nobody would be reading roman alphabet text and
those langs above text together. Saying that we need to put
roman alphabet text in the same standard as those langs
above text is like saying that graphics & text need to share
the same format. Would you like to save all your text in XPM
format? So, like i said before, don't take ASCII away.
Re: It's not that bad, actually
>
>
> % Oh really? Doesn't it rather depend
> on
> % who will be using and writing the
> % program? Wouldn't it make more sense
> % for, say, an application that
> computed
> % German telephone rates, to be in
> German?
> % If everyone writing and using the app
> % are German speakers, why should they
> use
> % English?
>
>
> That's a typical example of an in-house
> application. And if it's not, there is a
> good chance of it to be applicable to
> other countries too.
> Please don't get me wrong, I don't mind
> if someone writes non-English source
> code. But if you do so, you just can't
> expect any help from the world-wide
> software community. And that's what Open
> Source is about, isn't it?
>
> Also, I happen to be a native German
> speaker, and I can tell you some purely
> pragmatic reasons to write your code in
> English:
> a) programming languages use English
> keywords
> b) all openly available libraries (I
> know of) use English identifiers
> c) it's just easier to express technical
> problems in English
> d) if you're into programming, you have
> to know English anyway
> e) it trains your English skills
> f) you're just used to
>
I don't agree with the c) point ... I find it harder to express problem / solution in english, than in my mother language.
And all other point only apply to code, not comments...
Re: What I could use!
>
> I'm not too much into this, but I
> believe Pango addresses this particular
> problem. If you right-click on an entry
> box in GTK+ 2.0, a context menu appears
> which allows you to switch input methods
> on the fly. I haven't really tried it
> yet though (you probably need special
> hardware for it, dunno), but could that
> be what you're looking for?
>
I thought he is looking for an efficient way to
include foreign characters into a text in an
efficient way. The Pango idea is an upper-level
thing that excludes textmode users. As I wrote
in my article, I propose a system-wide (kernel
module?) language input parser, that, by pressing a
key-combination, can be switched to another
input. And about the keyboard layout, I think,
there should be an additional translation table
where one can how the keys on a particular
keyboard produces keycodes that are handled
as input to a foreign language parser. So, Microsoft's
IME for Japanese uses QUERTY, my keyboard
has QUERTZ (German), and I cannot simply say:
All my z's are y's from now. I have to change
the whole sillable mapping or I adapt to type QWERTY
blindly. Of course, there must be an overview of
the code tables and also the possibility to type
in character code directly if somebody is not aware
of the reading. Did you mean that by "efficient"?
My hope is that we can start a development group
that is working on these low-level input thing. As a
result, we could provide a transistent way of entering
multi-language text and also GUIs can turn their focus
on the graphics.
Re: Why must you kill ASCII?
> Very different. I'm saying that the
> languages that don't use
> the roman alphabet should define their
> own standards.
> Like, arabic, japanese, korean, chinese,
> hebrew, they should
They have their standards, there's no
"character anarchy". :-)))
But (take a look under "dictionaries"
in your favourite bookstore) sometimes
text of different languages must be mixed.
And when standards exist whose
codepages intersect (every country would
have its own system ranging
from 0 to what they need) it gets difficult...
> the same format. Would you like to save
> all your text in XPM
Exactly, no one would do that! So we still
want to handle text (eg comment it in an email),
even it is from a person of another mother tongue.
So the computer must understand it...
Re: It's not that bad, actually
> % So one can program in non-ASCII. I
> % don't see any
> % difficulties besides the recoding of
> % given sources.
> No, you can't. I'm talking about Open
> Source, and source code not written in
> English is not open.
I wasn't talking about non-english source code,
(though there are people that do that and like
the way they do so), I thought of malloc's and
sizeof's.
> % UTF-8, it passes it to a library
> which
> % has to parse it. It takes time and is
> % not very secure.
>
> Time is less of a problem than memory
> nowadays. Also, I'm not sure what you
> mean with "security".
Text is still the least mass of data. We have
audio and video streams that really need
gigabytes. But anyway, security can turn into
a problem when we pass UTF-8 into functions
that replace characters or substrings. Then
the strlen of the new character (it sounds crazy
but with different encoding sizes this is the way)
might differ from what it replaces. Especially
in critical situations (ftp/http daemons, passwd
and the whole security stuff), "evil hackers" might
want to use this vulnerability.
In my eyes, this leads to a lot of problems.
A clear defined char size would make things easier.
Re: Why must you kill ASCII?
Oh my god, you're driving me crazy. Let me yell at you to make sure you understand it:
UTF-8 is entirely ASCII compatible! None of your f*cking ASCII text files will require any kind of f*cking conversion!
Sorry.
Re: It's not that bad, actually
> Text is still the least mass of data. We have
> audio and video streams that really need
> gigabytes. But anyway, security can turn
> into a problem when we pass UTF-8 into
> functions that replace characters or substrings.
> Then the strlen of the new character (it
> sounds crazy but with different encoding sizes this
> is the way) might differ from what it replaces.
a) This applies only to replacments of single characters, not substrings.
b) Please tell me a realistical example where replacing a single character is of any use, without conflicting with i18n issues.
> Especially in critical situations (ftp/http
> daemons, passwd and the whole security stuff),
> "evil hackers" might want to use this vulnerability.
> In my eyes, this leads to a lot of problems.
> A clear defined char size would make things easier.
I don't think rewriting all apps to use a different char size would lead to less bugs. And anyway, I think some of the other postings covered that topic very well: single character operations are of somewhat limited use, regardless whether you use UCS-4 or UTF-8.
And don't forget the endianness problem.
Re: What I could use!
I don't believe you will get anywhere by waiting for a "group of developers" to
do it... If you want to see your ideas carried out, just get going! If it
proves useful, people will adopt it, and maybe some day it will become a
standard -- that's the way developement takes place in the "Bazaar" world...
BTW, on a technical note: When conversion routines reside in the C library
anyways (the kernel is probably *not* the best place for them), then where is
the point in storing int chars to disk? I believe it would be perfectly OK to
keep these for the applications, and store UTF8 to disk... This should allow a
much smoother transformation, as "old" applications could just carry on as
always, while adopted ones would use the new conversion routines. (This idea
isn't really new; I considered using ints for application-level string handling
a while ago...)
Re: Why must you kill ASCII?
>
> Oh my god, you're driving me crazy. Let
> me yell at you to make sure you
> understand it:
>
> UTF-8 is entirely ASCII compatible! None
> of your f*cking ASCII text files will
> require any kind of f*cking conversion!
>
> Sorry.
>
>
So, UTF-8 is ASCII with the rest of the roman
alphabet filled in? That's what I said we should
do. (When I say the Roman Alphabet, I mean all
european languages that are based on that charset,
e.g, german, french, spanish, english, etc.)
Re: Why must you kill ASCII?
> So, UTF-8 is ASCII with the rest of the roman
> alphabet filled in? That's what I said we should
> do. (When I say the Roman Alphabet, I mean all
> european languages that are based on that charset,
> e.g, german, french, spanish, english, etc.)
UTF-8 is ASCII with the rest of Unicode filled in...
Hell, yeah, you see? No reason left to keep non-Roman languages out. Get yourself educated.
Re: You are wrong!
>
>
> % And, ASCII includes
> % symbols for european languages (not
> just
> % english), like german, french,
> spanish,
> % etc. Which are what half the world
> % speaks.
>
>
> You really have no clue. ASCII is a 7
> bit character set. It doesn't contain
> the characters needed for most European
> languages, e.g. German umlauts or French
> accented characters.
>
Actually, ASCII does contain the characters
needed for most European languages.
Ever heard of backspace? Overprinting was
the way to go, but somewhere along the line
glass teletypes lost that capability...
Two steps approach
1. Codepages *SUCKS*
2. Moving all Linux code to UTF-8 as a first step of getting rid of codepages is good. I expect all important linux code will be UTF-8 compatible in two years at most.
3. After we have everything translated to UTF-8, we can move to UCS-4 at no time. Moving there from point where we are now would be much more complicated becouse of codepages. This step (UCS-4) will make string operations both easier to write and faster while Only UTF-8 -> UCS-4 conversion will be needed in the system assuming all text will be UTF-8 already.
Re: It's not that bad, actually
> b) Please tell me a realistical example
> where replacing a single character is of
> any use, without conflicting with i18n
> issues.
strtolower, strtoupper, and of course
parsing stuff...but I got your point.
Yes, it's rather seldom. And you're right,
substrings cause problems anyway.
> I don't think rewriting all apps to use
> a different char size would lead to less
> bugs. And anyway, I think some of the
In case the compiler would return a different
char size, there would be even no need to rewrite
code.
Re: Your title, "True Internationalization" holds the key to the requirements.
> many languages share a character set.
Yeah, and as soon as we try to combine them we get some of the existing CJK Unicode problems. The only Right Way to do it is delineate by language, not by typography.
Re: It's not that bad, actually
> strtolower, strtoupper, and of course
> parsing stuff...but I got your point.
> Yes, it's rather seldom. And you're
> right, substrings cause problems anyway.
strlower() and strupper() are not i18n safe. For instance, strupper() would have to convert the German "ß" to "SS" -- which it doesn't. The g_utf8_strup() function in GLib 2.0 gets that right because it returns a newly-allocated string.
> In case the compiler would return a different
> char size, there would be even no need
> to rewrite code.
I daresay most C programs out there would break if you did that. I bet most programs contain some piece of code that assumes sizeof(char) == 1. And there's nothing wrong with that, IMHO -- char is a synonym for byte anyway.
You gotta make people think about i18n -- fooling around with the char size won't help.
Re: Two steps approach
> 3. After we have everything translated
> to UTF-8, we can move to UCS-4 at no
> time. Moving there from point where we
> are now would be much more complicated
> becouse of codepages. This step (UCS-4)
> will make string operations both easier
> to write and faster while Only UTF-8
> -> UCS-4 conversion will be needed in
> the system assuming all text will be
> UTF-8 already.
Yes, this is the way to go. And I suspect a lot of apps won't even need to convert to UCS-4 later. It should definitely be a per-app decision to do so -- if you just want to display strings, as most GUI apps do, there is no need to deal with character-conversion at all.
ASCII
Well, i've still got one point: ASCII is capable of supporting
the roman alphabet. For example, I went to a german web site,
found some german characters, pasted it into an xterm running
vi, saved it. closed vi. opened vi with the file. The german
characters were still there, ü for example, was still there. So, if
ASCII doesn't support that character, that would be
impossable. So, ASCII must support that character, right?
Even though your point about people who speak some
language like arabic, japanese, etc. and a roman language
may be true, but, there are a lot of people who speak only
roman-based languages. So, people who only speak roman
based languages should not have to make their files 4 times
bigger for the others, should they? So, UCF-4, etc. already
exist. So, if you plan to speak non-roman languages, use it.
I dislike non-roman languages.
Re: ASCII
You was just lucky enough. ASCII chars are 0-127 and "Umlaute u" you mentioned is naumber 129 in many codepages. So, it happens that in several codepages it is u but in some countries it can be anything else depending of codepage used. Codepage is the way how these chars (128-255) are interpreted. In Czech Republic we have many codepages, among others old "Soviet" style KOI codepage, Kamenicky (also obsolete), Latin-2, ISO, Windows-1250 to name a few :-). So we even have czech -> czech text file conversion programs.
Re: What I could use!
> I don't believe you will get anywhere by
> waiting for a "group of developers" to
> do it... If you want to see your ideas
> carried out, just get going! If it
Well, I said "we", didn't I?...I'm not waiting
for anybody else. I want to start, yes, but
if I do it alone then the result will probably
be a poorly designed, inflexible and therefore
unsupported piece of code. So I would like
to see more enthusiam to design this central
thing rather than discussing between UCS-4 or
UTF-8. What I have learned: People hang on
their codepages!!!!!!!!! So let's forget this encoding
discussion, I have some basic ideas to think about:
1. If the parser buffers would be system-wide static, then it
would hurt the multi-user ability. because of that,
input buffers must be available for each user (and
probably for each open input channel, quite offen
I have an X session with multible terminals and I
would get crazy when there is already input content
when switching from one to another. On the other
hand, we could also clear the buffers when switching).
2. There must be several layers to control input:
a - keyboard code - physically raised code
b - key mapping ("virtual" keys can be emulated)
c - globally unique encoding (whatever UTF-8/UCS-4)
d - input buffer
e - pre-parsed (displayable) content, where following context may require changes
f - output (displayable) context, which should be taken over
by the terminal or GUI
That is my first idea, maybe it's quite bullshit.
Please let me know if you have other ideas.
Re: Japanese is propritary
Man... Are you ok? :D
This comment was the laugh of the week for me, but still, it's a bit obnoxious. :)
> Japanese does not adhere to standards,
> it is propritary,
> not using the roman alphabet, is read
> from right to left.
> uses both kana and kanji, having two
> sets of char for
> the same thing. It's fine to make your
> own language, just
> use the standards. japanese is only
> spoken in one place,
> japan, and is not the offical language
> anywhere else.
> It uses it's own propritary charset, and
> a propritary method
> of reading (right to left), so japanese
> is not open! The
> roman alphabet is an open standard.
> Japanese is as closed
> as microsoft windows.
First, I think you twist the meaning of 'open'. Anyone can get the rules to japanese writing free of charge. Therefore, it IS open, just doesn't conform to 'standards'. But it IS the traditional language in Japan, and there is no choice but for the software to adapt. And don't come with the 'own standard' stuff again, since the whole thing is about OPENNESS and connectivity. 'Open world' doesn't mean everyone using English. It means being able to read Japanese, Korean, Chinese, Indian, Turkish, Russian, etc. on ANY modern computer ANYWHERE around the world.:)
Second, Japanese is read from the left to the right. No, really. There IS a traditional way, where it is read vertically, and the columns come from right to left, but nobody will complain if a software supports horizontal left-to-right only, since that IS the official way. And Kana and Kanji don't serve the 'same' purpose. The Kanji make writing more compact and reading much faster (for the experienced), helps the reader understand the text, and is also a beautiful system which I would'nt throw away if I were japanese.
Third, nobody asked YOU to work on Japanese input methods and character mappings. This is a job mainly for japanese developers, since they know their own language best. But it IS the duty of every single open source programmer to write his applications in a highly modularized and generalized way, so that it may be used with third party locales, input methods and character sets, without changing the source!
Re: ASCII
> You was just lucky enough. ASCII chars
> are 0-127 and "Umlaute u" you mentioned
> is naumber 129 in many codepages. So, it
> happens that in several codepages it is
> u but in some countries it can be
> anything else depending of codepage
> used. Codepage is the way how these
> chars (128-255) are interpreted. In
> Czech Republic we have many codepages,
> among others old "Soviet" style KOI
> codepage, Kamenicky (also obsolete),
> Latin-2, ISO, Windows-1250 to name a few
> :-). So we even have czech -> czech text
> file conversion programs.
Is Czech a european language? Well, anyway, I included
support for every codepage in my kernel. Isn't there a
codepage that supports all of the western alphabet?
I think we have space in ASCII to add all of the extra
characters in european languages. And, why would ASCII be
127 chars? Those extra chars in german, french, spanish, etc.
that were not included in nomal ASCII could be added to the
128-255 gap. I think they would fit. Most of the chars used in
european languages are already in ASCII, so all we have to
do is add the extra chars from european languages. That
would be possable.
P.S Did you know that the german word for cat is katze?
Re: Two steps approach
> 1. Codepages *SUCKS*
That's not valid english. It's not "Codepages sucks",
It would be "Codepages suck", and they don't.
Re: Why is this taking the form of a religious debate?
> most of the text files out
> there are compatible with ASCII. This is
> not racism or imperialism. It's
> pragmatism.
I agree with that.
Re: ASCII
% Is Czech a european language?
%
???
> codepage that supports all of the
> western alphabet?
It is hard to say what is "Western" alphabet. Czech alphabet is as western as English, so is Swedish, Hungarian, Polish, France, Turkish... even Bulgarian or Greek which have totaly different layout of letters. So it is much more then 256 chars and that's a problem. You cannot skip any language becouse there are institutions like EU which consider all these languages equal and they want to use several languages inside single document.
Re: Why is this taking the form of a religious debate?
> I wonder if this is really an issue.
> When I process the text, what I do
> mostly is either scan the text
> sequentially, or use some search
> operation (like substring match or
> regexp match). Indices are returned as
> the result of the search operation so
> that I can extract the matched region
> from the string, but they're not
> necessary to be a character index---any
> kind of pointer does the job, and it's
> possible to create such pointer object
> that access any part of string in
> constant time.
True enough. In this case, you are right (for C). I've been spending a lot of time in higher-level languages and forgot some of my C idioms.
> > Character replacement gets a lot more tricky as well.
>
> *** edited for brevity ***
%
> But in many situations the size of replacement string
> differs from the size of original region and I end up
> reallocating whole string.
And here is the crux of the matter. You are experienced. You probably remember to do this everytime. However you do not write the vast majority of software out there. I would venture a guess that the vast majority of software is written by someone who is not as proficient as you are. There is nothing stopping a good coder from doing as you say. However, a lot of coders aren't that good of coders and haven't yet learned good practices or, more to the point, good practices with regard to non-ASCII character encodings. Far more likely, even if someone knows better, after a many hour coding session, programmers make stupid mistakes. When all of your unit testing is with standard ASCII strings (for example), the compiler won't catch the error. There are far more programmers out there that have written code in C for fewer than two years than programmers who have written C for more than five years. While this is true for every language, many other languages have built-in abstractions for character strings that simply don't exist for standard C.
> I don't have any concrete performance comparison
> between fixed- vs variable-length string representation.
> Excuse me if my idea is irrelevant.
Not at all irrelevant. For most of the situations you describe, you are correct in that the speed difference would be negligible (in C). I was more concerned with maintainability -- especially in situations where you are not the maintainer. But for the most part, you were more correct than I with regard to the common cases.
Re: ASCII
> Even though your point about people who
> speak some
> language like arabic, japanese, etc. and
> a roman language
> may be true, but, there are a lot of
> people who speak only
> roman-based languages. So, people who
> only speak roman
> based languages should not have to make
> their files 4 times
> bigger for the others, should they? So,
> UCF-4, etc. already
> exist. So, if you plan to speak
> non-roman languages, use it.
> I dislike non-roman languages.
A couple of things. First of all, the most spoken language is not English. It's Mandarin (Chinese). There is obviously a place for non-ASCII character encodings.
Second, text files (as mentioned in the main article) are usually in the minority for space used on personal computer systems.
Third, there is nothing stopping you from running something like gzip on the text files which will not only get rid of the multi-byte tax, but they'll end up smaller than the equivalent ASCII file.
Fourth, I don't think it's really an issue for most folks when 100GB hard drives are becoming normal; that's a whole hell of a lot of text, multi-byte or not.
Fifth, it's UCS-4. Heh heh... nitpicking. ISO/IEC 10646 encoding form: Universal Character Set coded in 4 octets. UTF-8: Unicode (or UCS) Transformation Format, 8-bit encoding form.
Sixth, this discussion is from a programmer perspective and not really a user perspective. When you say that you dislike non-roman languages, I assume that you have no real experience with non-roman languages. That's fine I guess, but are you willing to state that none of the users of the programs you write like non-roman languages? This is the real issue.
Finally, if you use UTF-8 on your system, you will see no appreciable amount of wasted space over ASCII, but the potential to hold most other characters is still available. And as an added bonus, all of your standard ASCII documents are still valid. There is very little excuse to only support western characters when such an easy alternative is available.
Re: Two steps approach
OK, let's play the game...
> > 1. Codepages *SUCKS*
>
> That's not valid english. It's not
> "Codepages sucks",
That's not valid English either. a) You should use a capital letter, and b) a sentence ends with a period, not a comma.
> It would be "Codepages suck",
> and they don't.
They do suck, as well as you. Now go home to mom and let us do something more productive than replying to this ridiculous shit you're typing.
(Yeah, I know, I shouldn't feed the trolls. I just couldn't bear it, sorry everyone.)
ASCII
Come on, why don't you admit that ASCII has it's
uses? I know we need unicode to allow people who
speak non-roman languages to speak in those
languages, but ASCII is good for the people who
don't speak those languages. I will admit ASCII
is not all we need, but I do think it has uses.
Like, if you use english for your documents, ASCII
contains the character you need. But it does
include german characters. I saw a web page with
german characters in lynx. I pasted the characters
into vi. It worked. All the chars were still
there. All of them. If german characters appear in
lynx, wouldn't that prove that german characters
were supported by ASCII? Not only that, but I
saved that document. Reopened it. All the german
chars were still there. If you make a new
standard, that's fine. Just make sure you keep
it compatible with text mode. So that people can
still use lynx.
Re: ASCII
locale -k charmap
I pretty much doubt the output is charmap="ANSI_X3.4-1968" (that would be pure ASCII).
Re: ASCII
>
> locale -k charmap
>
> I pretty much doubt the output is
> charmap="ANSI_X3.4-1968" (that
> would be pure ASCII).
>
>
The output was:
charmap="ANSI_X3.4-1968"
The two german characters were:
Wait a minute! When I typed out
the cat command on the console to open the file that had
the german chars, they displayed on the console, but when I
tried to paste them into this message, it only showed up as
two question marks! But, when I pasted it into an another
xterm running vi, the chars appeared. And, when I opened
the file using Nedit, the chars appeared, and I was able
to paste them into this message, but when i tried to paste it
from Nedit to the Xterm, they appeared as two question
marks! It seems that the german chars can only be pasted
from one Xterm to another, or from one X application to
another. But, the german chars showed up in the file and lynx
outside X. What happend?
Re: ASCII
> issue for most folks with 100GB hard drives
My computer does'nt support 100GB hard disks,
I only have a 20GB hard drive. Am i supposed
to pay $6000 for a computer with a 100GB hard
drive?
Re: ASCII
> The output was:
> charmap="ANSI_X3.4-1968"
You're running in the C locale. It only worked because the characters weren't checked for correctness and sent directly to the console, which probably understands ISO 8859-1 by default.
> But, the german chars showed up
> in the file and lynx outside X. What happend?
That's exactly the kind of problem we're trying to fix -- I don't know what happened. This whole charset nonsense is way too complicated. LC_ALL=en_US might help you, dunno.
Re: ASCII
>
>
> % The output was:
> % charmap="ANSI_X3.4-1968"
>
>
> You're running in the C locale. It only
> worked because the characters weren't
> checked for correctness and sent
> directly to the console, which probably
> understands ISO 8859-1 by default.
>
>
> % But, the german chars showed up
> % in the file and lynx outside X. What
> happend?
>
>
> That's exactly the kind of problem we're
> trying to fix -- I don't know what
> happened. This whole charset nonsense is
> way too complicated. LC_ALL=en_US might
> help you, dunno.
>
>
I'm sorry, I REALLY thought german characters were
supported by ASCII. What is ISO-8859-1? Is it a subset
of ASCII that supports german chars? But, I was sure
ASCII supported german chars, I saw german chars
on the console. So, the console can
display chars outside ASCII? Well, I thought the console
could display only chars in ASCII. My console can display
chars from all european languages, but not from any
non-roman languages. So, we don't need ASCII to keep
console mode? Since consoles can't display any japanese
chars, what should we do? get rid of the console and only
use X? But, you can't get into X without the console. That
means we wouldn't be able to use any OS based on a
console anymore. Does this mean linux is dead? Are we
doomed to use Windows XP? If you include japanese chars
in a new standard, the console is dead, and linux is dead.
But, if you exclude non-european characters from a new
standard, even if it were 8 bytes per char, linux & the console
will still work. In other words, we need a new standard, but
we need to keep it compatable with the existing consoles.
Re: ASCII
> supported by ASCII. What is ISO-8859-1?
> Is it a subset of ASCII that supports german chars?
ASCII is a subset of ISO 8859-1, which is the most commonly used 8bit extension character set in Europe. But don't think it covers all European languages.
> In other words, we need a new standard, but
> we need to keep it compatable with the
> existing consoles.
Believe it or not, I'm running UTF-8 in my console:
locale -k charmap
charmap="UTF-8"
Re: ASCII
> Believe it or not, I'm running UTF-8 in
> my console:
>
> locale -k charmap
> charmap="UTF-8"
>
>
UTF-8 is a good idea! But, how did you change you
console to use UTF-8?
Re: Japanese is propritary
> Man... Are you ok? :D
> This comment was the laugh of the week
> for me, but still, it's a bit obnoxious.
> :)
>
>
> % Japanese does not adhere to
> standards,
> % it is propritary,
> % not using the roman alphabet, is read
> % from right to left.
> % uses both kana and kanji, having two
> % sets of char for
> % the same thing. It's fine to make
> your
> % own language, just
> % use the standards. japanese is only
> % spoken in one place,
> % japan, and is not the offical
> language
> % anywhere else.
> % It uses it's own propritary charset,
> and
> % a propritary method
> % of reading (right to left), so
> japanese
> % is not open! The
> % roman alphabet is an open standard.
> % Japanese is as closed
> % as microsoft windows.
>
>
> First, I think you twist the meaning of
> 'open'. Anyone can get the rules to
> japanese writing free of charge.
> Therefore, it IS open, just doesn't
> conform to 'standards'. But it IS the
> traditional language in Japan, and there
> is no choice but for the software to
> adapt. And don't come with the 'own
> standard' stuff again, since the whole
> thing is about OPENNESS and
> connectivity. 'Open world' doesn't mean
> everyone using English. It means being
> able to read Japanese, Korean, Chinese,
> Indian, Turkish, Russian, etc. on ANY
> modern computer ANYWHERE around the
> world.:)
>
> Second, Japanese is read from the left
> to the right. No, really. There IS a
> traditional way, where it is read
> vertically, and the columns come from
> right to left, but nobody will complain
> if a software supports horizontal
> left-to-right only, since that IS the
> official way. And Kana and Kanji don't
> serve the 'same' purpose. The Kanji make
> writing more compact and reading much
> faster (for the experienced), helps the
> reader understand the text, and is also
> a beautiful system which I would'nt
> throw away if I were japanese.
>
> Third, nobody asked YOU to work on
> Japanese input methods and character
> mappings. This is a job mainly for
> japanese developers, since they know
> their own language best. But it IS the
> duty of every single open source
> programmer to write his applications in
> a highly modularized and generalized
> way, so that it may be used with third
> party locales, input methods and
> character sets, without changing the
> source!
Well, I don't think japanese is a good language
for computing. It was created 3000 years ago,
and is not suitable for computing. The concept
doesn't fit into japanese. It has 2000 chars,
and most japanese people can't read or write
until they reach 13. Japanese is not suited for
computing. Japanese was more for odd japanese
rituals, and that's what their language is built
for. The way japanese was built is very silly
for computing. It was really built for a bunch of
people running around with dragon costumes.
Also, I didn't think japanese people used english
on their computers, I thought they used 2000 char
keyboards. A language with 2000 chars, mostly
based on paper, is very hard to put in computers.
Japanese is more a historical, ritualistic,
poetic language, than a logical, ordered,
modern-computer language. Japanese is not a good
language. Use HTML, C++, English, German, French,
Perl, Java, etc. Japanese just won't work well in
computers. I am not defending ASCII anymore, but
I don't think japanese is a computerish language.
I do think Japanese is good for expressing
feelings, and I do think it might be good for
poetic expession, but not a computer language.
It's more of a people-orented language. Not good
for computing and logical thinking.
Re: ASCII
www.tldp.org/HOWTO/Uni... (www.tldp.org/HOWTO/Uni...)
Be sure to apply the bash patch, it's a nightmare otherwise. Here's one for bash-2.0.5: www.li18nux.org/subgro... (www.li18nux.org/subgro...)
Re: ASCII
Where do you live that a new computer costs $6000? A new motherboard here costs anywhere from $80-$200. A whole new (very fast) computer can be purchased for less than $1000.
But you're right. A lot of people still have 20GB. That's only about 20 billion* characters of text (ASCII or UTF-8 in a western country) give or take a couple of billion for program data. Now let's move to UCS-4: still 5 billion (give or take) characters. Now let's compress that with something like gzip: getting closer to 150 billion characters.
I fail to see your point.
* U.S. billion -- thousand million in some other locales.
Re: ASCII
> Where do you live that a new computer
> costs $6000? A new motherboard here
> costs anywhere from $80-$200. A whole
> new (very fast) computer can be
> purchased for less than $1000.
> But you're right. A lot of people still
> have 20GB. That's only about 20
> billion* characters of text (ASCII or
> UTF-8 in a western country) give or take
> a couple of billion for program data.
> Now let's move to UCS-4: still 5 billion
> (give or take) characters. Now let's
> compress that with something like gzip:
> getting closer to 150 billion
> characters.
> I fail to see your point.
> * U.S. billion -- thousand million in
> some other locales.
>
I didn't mean 20GB wasn't enough for UTF-8. I've started
using UTF-8 now. UTF-8 is a good idea, because all ASCII
chars will still be ASCII, but other chars would be encoded as
2-4 bytes. Good deal. But, you shouldn't assume everyone
has 100GB hard drives. I only got a 20GB drive 1 year ago.
Before that, I only had a 3GB drive. The only reason I got a
20GB hard drive was because the 3GB one wore out. My old
486DX still uses a 1GB hard drive. As for your question about
where I live: Dallas, Texas, USA
Re: ASCII
>
> www.tldp.org/HOWTO/Uni...
>
> Be sure to apply the bash patch, it's a
> nightmare otherwise. Here's one for
> bash-2.0.5:
> www.li18nux.org/subgro...
>
>
Thank you for your help.
UTF-8 is the way to solve all the problems.
So, anybody who wants "true internazilization", UTF-8 is
the way to go.
Re: Japanese is proprietary
Japanese Kanji (if memory serves) is based on the tradional Chinese characters much like many other east-asian pictographs. However, I believe you have a fundamental misunderstanding of the use for extended alphabets (for lack of a better term).
In China, there are many different dialects. Many of these dialects are different enough from each other as to make verbal communication difficult. This is natural. Anytime you have a large number of people spread out over a wide area, speech and customs diverge. That said, for all intents and purposes, any literate person in China can read what any other literate person in China has written. This is possible because while the western alphabet is based upon phonetics which vary after language evolves, pictographs are not dependant upon how they are spoken. In fact, even though a great deal of time has passed, many Japanese citizens can get by reading Chinese glyphs even though the languages are noticeably different at this point.
On a side note, I really would love to know where you get the notion that Japanese children are effectively illiterate until they reach the age of thirteen. Fascinating theory.
More people-oriented? Not computer-oriented? Who exactly do you think uses computers? Computers? The computers don't care. The computers just want opcodes and data streams. Correction: the computers don't want anything. Without computers, people still have a reason to be. Without people, computers cease to have any reason to exist. Most text isn't source code. Most text on a computer system is personal data or localized messages for the reading pleasure of people. For millions of people on the planet, that personal data and those localized messages are in Japanese. This is the information that people want to retrieve from lynx. This is the information that people want to get from their computer no matter where they are. This information is people-oriented. And this information is loaded and saved from programs that must understand other character sets and must be displayed on computer screens (even in text mode).
No one in this discussion -- absolutely no one -- has stated that they believe that we should rewrite C so that all of the if-else/while/for is translated into a different language. I know of no person who honestly thinks that's a worthwhile goal.
We have not been talking about switching source code and programming language syntax so that it reads like Japanese or Korean. We have been talking about writing that source code so that Japanese data files (and other languages on the planet) can be efficiently and transparently processed and displayed.
Please go back and read the last paragraph over and over again until it finally sinks in. I am happy that you have accepted that UTF-8 is a valid substitute on most users' systems. That removes any disc storage issues that you have previously brought forth. What we are talking about is making sure that programs can read those UTF-8 (or UCS-4 depending on the program's goals) files. No one is trying to make you learn how to read Chinese or Japanese or Thai. By all means, please ignore these languages. Let someone else deal with bi-directional text display issues. Just use UTF-8 much as you previously used ASCII, and live a long and prosperous life with your western alphabet. Now everyone is happy.
Re: Japanese is proprietary
What I'm saying is that japanese is a very picture-based
language. It's really more based on ancient customs. It's not
computerish. Too much human silliness in this language. Over
2000 chars in this language, and a new one is added
everytime a new concept is created. It's not logical and ordered like western languages. Their language is very stuck in the past. Western languages are better. It's good for keeping the ancient history of that nation, but most asian lands are not very geared toward advanced ways. Most of them were content to do things the same way they did in the 8th century. By the way, do you speak japanese natively?
Re: Japanese is proprietary
Japanese is not a completely picture based language, it is possible for someone to write Japanese with only the base set of around 50 characters. In fact it is possible to write Japanese using Roman alphabets which is why the vast majority of Japanese people (Asian people for that matter) use QWERTY keyboards with computers. Also coders all across the world code in English and will continue to code in English.
However with if "true internationalization" is realized, developers across the globe will be able to read and modify each others' code without worrying about character corruption (though they may not be able to understand the strings that are in a different language). Coding languages will not be changed, people who speak only speak or write English will continue to code in English. In fact, from their perspective nothing would have changed.
Japanese is as estoeric and illogical as English. As people who speak English as a native language (as I do) or have studied it know, English grammar invariably contain exceptions to every rule, something that can be attributed to its rich history and background.
Spoken Japanese is the same as English, Japanese children learn to speak the language quickly enough and if driven to do so can learn all the intricacies of the spoken language. The same can be said about the written language, at a very young age, Japanese children have mastered the fundamentals of the written language. As the progress through the educational system they learn new characters, very similar to the process that native English speakers experience as they learn the meanings and spellings of new words. The difference is in English it is not always easy to learn how words are spelled from how they are pronounced, but by knowing the meaning, one can guess the spelling of a word by knowing the prefix, suffix or etc. of the word. By hearing a word, a Japanese person immediately knows how the word is spelled and written in the fundamental characters. What the Japanese student learns when learning new characters is akin to learning a new prefix or suffix. As the student picks up these new characters he or she will learn to use write sentences or phrases in a more efficient way.
These kanji (chinese) characters have been used for along time and are continue to be used by close to two billion people natively in countries such as the People's Republic of China, Taiwan, North and South Korea, Singapore and Japan. Yes, every student will at times find learning their own native language difficult, but I believe this is something that you can relate to as can be seen in your frequent mistakes in grammar and spelling (though they could just be consistent typos). As a native speaker of English and Japanese, and having also learned to speak Chinese and Spanish I have had my struggles with languages, but have found that each language can be found equally logical and brilliant. In fact as my Korean friends have pointed out and have impressed upon me, Korean might have the most logical and easy to learn written language.
However at the and of the day, the beauty of internationalization is not behind the fact that you as a native speaker of English, you can continue to use English and only English in your coding and other uses. It is behind the fact any language in the world can be written and displayed anywhere.
Re: Japanese is proprietary
I believe you're mistaken on several major issues.
What the hell do you mean by 'computerish'? Japanese in itself is a crystal clear language with a simple grammar and almost NO exceptions. It's much more of a 'computerish' language than English in the sense of making Japanese-speaking artificial intelligences.
Also, it is a modern language. Saying that it hasn't changed for 3000 years is sheer idiocy, since Japan - and the Japanese language - didn't exist in today's sense in that age. It went through many mutations and reforms, grammar and writing system alike (which, in turn was adopted from the chinese, who DID create it several millennia ago - but even they kept on continuously refining it). The last big reform was after World War II.
To dissolve another huge misconception, new characters are NOT created for new ideas. In China perhaps, but the Japanese writing system is as static as the roman alphabet. Names for new ideas are often taken from English and written in katakana, or in the case of more general things, expressed with word joints.
And even so, a language being complicated and artistic shouldn't be an argument for excluding it from computing standards. As it was stated, computers are for people to use. They CAN take the 2000 characters, even more. They just don't mind. The nasty part of programming the support will be done by people who are motivated by the love of their own language. You only have to ACCEPT the demand and write GENERAL code. :)
What happend to rxvt?
If you go to rxvt.org, you notice that their web site hasn't been
updated since 2000, and, they say the last release was 2.7.3.
But, if you go to ftp.rxvt.org, you see rxvt 2.7.8 was released in
2001. And, if you go to their cvs on sourceforge, there were
updates until 2 months ago. That's weird. There was no
indication on their web site that the project was dead, but at
first i thought: maybe the project died without notice. But, when
I saw the ftp site had been updated until 2001, that no longer
made since. What happend to rxvt?
Re: Japanese is proprietary
> I believe you're mistaken on several
> major issues.
> What the hell do you mean by
> 'computerish'? Japanese in itself is a
> crystal clear language with a simple
> grammar and almost NO exceptions. It's
> much more of a 'computerish' language
> than English in the sense of making
> Japanese-speaking artificial
> intelligences.
> Also, it is a modern language. Saying
> that it hasn't changed for 3000 years is
> sheer idiocy, since Japan - and the
> Japanese language - didn't exist in
> today's sense in that age. It went
> through many mutations and reforms,
> grammar and writing system alike (which,
> in turn was adopted from the chinese,
> who DID create it several millennia ago
> - but even they kept on continuously
> refining it). The last big reform was
> after World War II.
> To dissolve another huge misconception,
> new characters are NOT created for new
> ideas. In China perhaps, but the
> Japanese writing system is as static as
> the roman alphabet. Names for new ideas
> are often taken from English and written
> in katakana, or in the case of more
> general things, expressed with word
> joints.
>
> And even so, a language being
> complicated and artistic shouldn't be an
> argument for excluding it from computing
> standards. As it was stated, computers
> are for people to use. They CAN take the
> 2000 characters, even more. They just
> don't mind. The nasty part of
> programming the support will be done by
> people who are motivated by the love of
> their own language. You only have to
> ACCEPT the demand and write GENERAL
> code. :)
You're just saying that because you speak japanese.
The characters are way too complex, It's too dependant on
being able to draw the characters by hand, so you can't
use japanese with a static charset. And you said that there
were differences between japanese and chinese. You only
know that because you speak japanese. You know that
language is very weird and not very ordered. You know
japanese is an obsolete language. You just don't want to
switch to a roman based language, and it's understandable
that you would want to hang on to your native language,
even though it's not effective for computing.
Re: Your title, "True Internationalization" holds the key to the requirements.
%
> Yeah, and as soon as we try to combine
> them we get some of the existing CJK
> Unicode problems. The only Right Way to
> do it is delineate by language, not by
> typography.
Let me guess; your native language is either C, J or K? There are two main problems to your solution. First, there are about 5,000 languages by some counts, and boundaries are very ill-defined in some case; worse yet, computers have to handle historical texts, which add a whole new dimension to the problem.
Secondly, while Chinese and Japanese may believe their character sets are entirely disjoint, Europeans usually percieve the Latin character set as one connected whole. "valet" comes out of my keyboard just fine, and few would argue that it needs to be stored or displayed differently if it's French or English. Likewise, saying that "Ulrich Drepper, Robert M&#00fc;ller and Ri&#010d;ard &#010c;epas worked on a project" (freshmeat won't let me include the actual characters) comes naturally. Note that the only mono-lingual European character sets are ISO646-*, which had only a few characters to work with. Once eight bit sets became common, all major European character sets covered multiple languages.
Re: 16 bits is enough
> It's really a myth that 16 bits isn't
> enough. People make statements about how
> there's some huge number of Chinese
> characters, too many for 16 bits. [...]
It's really a moot point. There's no competing 16 bit standard. To support Cantonese, Hong Kong Chinese and modern musical notations (all important things), one must support full 32-bit Unicode. (For Cantonese, I'm told one of the characters is the equivalent of the English -ing suffix, necessary for almost any written Cantonese.)
Re: It's not that bad, actually
%
> Why do you think it's wrong? Be
> realistic. If someone sends a code
> snippet to a mailing list, asking for
> help, (s)he should better write it in
> English or (s)he will be ignored.
>
Wouldn't it matter which mailing list? I doubt debian-devel-french would have any problem with a code snippet in French.
Re: You are wrong!
> And, arabic
> uses 500+ chars as
> well,
Nope. Arabic uses about the same number of characters as English does.
> Well, they could, by stealing the
> word from our language
English may have gave Arabic "hard drive", but it got "algebra" and "algorithm". I think English got a better deal.
Honestly, take a look at the history of English and English words some time. The majority of English's vocabulary came from French, Latin or Greek, with borrowings from just about every other language in the world.
> these
> languages are
> grossly inefficent.
I take it you speak with these languages? Because it seems pretty arrogant to condemn them without knowing them that well. Also, what's wrong with Loglan? I assume since you're pushing English as a logical language, you've analyized Loglan and found it wanting. Right?
Re: Japanese is proprietary
> Without computers, people
> still have a reason to be.
Where did you get such a silly idea? Where have you been for
the past 5 years? Have you any idea how silly that sounds to
computer people?
please, jeff
please, wipe out the whole comments board, ban access to fm-comments to this idiot and let's start the discussion again.
it's causing me a headache reading these outputs of some brain-dead child...
thanks...
Robert
PS: read it through and you'll soon find out who am i talking anout
Re: please, jeff
Trolls only stick around as long as they're fed.
Re: Japanese is proprietary
>
> You're just saying that because you
> speak japanese.
> The characters are way too complex, It's
> too dependant on
> being able to draw the characters by
> hand, so you can't
> use japanese with a static charset. And
> you said that there
> were differences between japanese and
> chinese. You only
> know that because you speak japanese.
> You know that
> language is very weird and not very
> ordered. You know
> japanese is an obsolete language. You
> just don't want to
> switch to a roman based language, and
> it's understandable
> that you would want to hang on to your
> native language,
> even though it's not effective for
> computing.
>
Well, for one thing, I am not a native Japanese speaker, I am Hungarian. Our language does resemble Japanese a bit in terms of grammar, but uses the roman alphabet (the latin-2 charset), and has LOT more exceptions, 'stupid' and 'obsolete' rules than Japanese. I also speak English, and know that it ALSO has more of the formerly mentioned problems than Japanese. :)
Knowing that there is difference between Japanese and Chinese is not a secret, actually it is common knowledge on MY planet, totally obvious to any grade schooler... (And we're not even Asia...) But I admit, whether this is different in the USA, I know not... :D
Japanese CAN be used with a static charset, I have one here on my Linux system... The problem is NOT getting the computer to display Japanese. It already can. The problem is bringing this whole charset-madhouse to an end, and forget all the nightmare incidents of Kanji being displayed as rubbish, 'o doubleacute' as 'o tilde', and such... :)
The language is not weird, and IS ordered. I think, speaking both Japanese and English (and German and Hungarian as well, the latter as my native tongue), I can say this with certainty - while you, speaking apparently English only, have no basis to fight my arguments.
It's like starting to argue with an Australian on whether a Kangaroo has six legs or four... :D
Still, the roman alphabet does seem to be an optimal choice for inputting text into a computer. Programming languages, scripts, control codes should use that and no other. BUT being able to use different, more complicated charsets for personal text editing, desktop publishing, and such is a basic, reasonable and undenyable demand.
It is not the user who should be subservient to the developer, but the developer who should work to create functions the user wants. :)
Many users want to use different charsets, different languages on ONE computer. Without rebooting, restarting programs and fiddling with environmental variables.
Reasonable enough. It's time to realize it. You don't HAVE TO, but if you write self-contained, unflexible codes, please include a warning text: 'WARNING! Due to obsolete coding methodology, interoperability with third party software is uncertain.'
Talk about 'proprietary'... heh...
Re: ASCII
> A couple of things. First of all, the
> most spoken language is not English.
> It's Mandarin (Chinese). There is
> obviously a place for non-ASCII
> character encodings.
Mandarin Chinese may have the largest number of persons who speak it as their first language, however, It is debatable as to whether Chinese is the "most used" language in the world.
See
As a reasonably well-traveled individual, I never cease to be amazed at the number of countries I have visited where English (many times with an accent) is available to the locals - and that applies to the Middle East, South Asia, Japan, Korea, Viet Nam, the Phillipines, Holland, Scandinavia, parts of Africa, and other places. And in none of these locations would English be classified as the "native" tongue. Yet in some, it is preferred
to the local language(s) or dialect(s) because of its universality.
In the computer software discipline, the employment of English (or a language with primitive elements derived from English/Romance words, such as Fortran, Pascal, Ada, C, C++, and so on) is pronounced. I know of no computer programming language wherein the base language (spoken or written) is Mandarin Chinese. Indeed, I believe that the Chinese Idiogrammatic language forms would prove difficult to use as the
basic "alphabet" or symbology of a computer programming
language.
My experience is that even the Japanese, who type into word processors and personal computers, use an anglicization/romanization at the keyboard called "romaji" to enter the sounds of the Japanese language, which are then converted to Katakana/Hiragana, and/or Kanji, for output to the display. I have done it while in Japan, but it isn't easy for one whose Japanese is limited. However, it is second nature to PC-aware Japanese (and they are becoming intensely PC-aware).
Although I have long since forgotten the details, I remember reading perhaps 40 years ago about an English person who invented a romanization system for the Chinese a long time ago, which sounded similar to the Japanese method of getting from
Romaji to Kanji. IIRC, he did it to help little children learn sounds of the language, and enable them to bridge to the concepts in the ideograms. BTW, the Chinese ideograms,
while extremely similar to the ones used in Japan, and may have similar meanings, rarely have even remotely similar pronunciation. So, a Japanese person can probably get the gist of a written Chinese ideogram, but would have to know (one of) the Chinese dialects to be able to verbalize the ideogram.
I guess the point of this is that even in the Orient, where the idiogram reigns supreme with some of the literati and illuminati, they are forced to revert to the primitive 26-character English alphabet to enter computer input. So, when you think about i18n, I suggest thinking about the input issues as well as the output(display) issues, and I suggest that there are basic communication issues to be addressed - not just between peoples, but also between the human and the machine. For
example, in Japan, romaji, and the English language is universally taught as a second language, and has been for a great many decades. It is necessary, since romaji is taken as a given.
Re: ASCII - WOOPS my ref. dropped out...
> See
iteslj.org/Articles/Ki...
Re: ASCII
> Mandarin Chinese may have the largest
> number of persons who speak it as their
> first language, however, It is
> debatable as to whether Chinese is the
> "most used" language in the world.
On the contrary, I do believe that it is most used. English, however, is most likely the most widely known and understood. English is by far the most widely used for commerce, but other than that, more often than not, people in different countries converse in their native language (they *use* a non-English language).
But this is missing my point. For some curious reason, some folks have come to believe that I think that all English-based programming syntax (if/while/for) should be changed to Chinese pictographs. For the last time, hear this: I did not say this! I did not imply this. I implied that there is a sufficiently large number of people to warrant the processing and display of non-romanized text.
I am aware of the use of romaji for input purposes in Japan. I am also aware that many of those uses of romaji also include a kanji/hiragana/katakana translation step. In other words, after the romaji is input, a list of applicable pictograph substitutes is presented for selection. This is not every case, but it is a very common case in my experience. To put it into context, it would be the equivalent of an English speaker always writing 'to' in their writings whenever 'to', 'two', or 'too' was intended. Sure the reader could figure it out, and they all sound the same, but wouldn't it be better in many cases to take the time and select the correct one? "I have to presents to give to my to brothers to." See what I mean?
A keyboard does not have to pictograph-based in order for a computer to handle pictograph data. There are also advances in handwriting technology. Right now, it is not uncommon for reporters in Japan to hand-write their notes and transcribe them later instead of using a laptop and romaji translation because the laptop is slower for them. Handwriting recognition would remove this barrier. Of course, it would require that an i18n capable OS and editor is available -- hence the point of this discussion.
Note: the point of this discussion is not that we sould scrap all keyboards in current use either.
And yes, I am aware that while the characters of China and Japan are very similar, their speech, pronounciation, and cultures are very distinct. In fact, other posts of mine in this discussion have pointed out this fact; however as not all computers have text-to-speech engines and are visually accessed from a screen in most cases, the importance of being able to display and input the characters is still much more relevant.
Yes, east asia uses the romanized alphabet extensively. Any visit to east asia, however, will demonstrate very quickly that it is not used to the exception of pictographs. There is demand out there to handle both.
Re: What about typesetting?
Why not let all this internationalization to be the burden of typesetting and wordprocessing programs? The ``source'' can be stored in ascii without much trouble ... I did this in LaTeX with English, Romanian and Polytonic Greek in one document ...
Then think of the costs of internationalization, the trouble of standardization and coping with programmers' hybris ...
Emil
P.S.: I have no claim to correct English in my reply
Re: ASCII
>
> % issue for most folks with 100GB hard
> drives
>
>
> My computer does'nt support 100GB hard
> disks,
> I only have a 20GB hard drive. Am i
> supposed
> to pay $6000 for a computer with a 100GB
> hard
> drive?
>
My computers are mostly ancient P120's and several of them have drives larger than 100 gigs. Most computers can support these sizes ofd rives if you upgrade their bios. In cases where they still have problems the hard drive companies typically have a utility you can run that'll run before your OS that will enable support for the large drives. I've yet to find a Pentium or newer computer that couldn't support any size drive I slapped into it.
Re: ASCII
Don't divide the world like Hitler did. ASCII is for nazi-thinking people.
I support UTF-8 - that's the future for the world where the half of people speak languages which are not compatible to ASCII.
There is still another and much bigger problem yet to solve: timezones.
By the way, does anyone know anything like UTF-b but targeting the chaos in timzones?
Re: Why is this taking the form of a religious debate?
Absolutly agree. I don't understand why people thinking that indexing the Nth character needs to be fast.
In any text-processing I can think of, N must be calculated first by scanning all the characters before it. It is trivial to replace N by the byte count and continue with the algorithim as before. So I see no savings there.
Also all modern text processing thinks about "words" and these are variable-length. It makes no difference if the characters inside them are variable length as long as it is easy to detect the word boundaries.
I recommend, with NO EXCEPTIONS, that UTF-8 be used for every single interface in a system where text is passed. There should be no "ASCII" interface, and certainly there should be no "wide character" interface. I don't think any programs will have to store or manipulate text in any form other than UTF-8.
A huge win with the UTF-8 only is that it eliminates the need for multiple interfaces. strlen() and so on are unchanged except they are defined to return the number of bytes in the string.
Re: Why is this taking the form of a religious debate?
I recommend UTF-8 ONLY for percisely the reasons you say require wide characters. Maintainability and testing.
If there are two interfaces for "ASCII" and "Wide characters" then the typical programmer is only going to test the "ASCII" interface and there are going to be bugs when an i18n user tries it. However if there is only ONE interface, "UTF-8", then that interface is going to be tested!
Also no "amateur programmer" is going to successfully replace any characters in any string. The function "replace chars n-m with these m-n other characters" just is not used by anybody. Check Visual Basic if you don't believe me, that function does not exist (the replacement is allowed to be a different size). Any amateur programmers coming from that background are not going to want to do anything that you cannot do in UTF-8.
Re: It's not that bad, actually
More importantly, even real Japanese or Chinese has so many spaces, digits, control characters, punctuation, and imbedded latin text, that it will be shorter in UTF-8 than in UCS-16 or any of the other proposed encodings. There is no bias whatsoever in UTF-8, it really is a crude form of Huffman encoding. I also see no reason why a Chinese word that translates to a several-character word in english must be stored in one-character of space, if anything you are presenting a reverse-bias.
Re: It's not that bad, actually
Wrong. strlen() in utf-8 should return the number of bytes.
To prevent security problems, low level stuff should treat filenames and so on as a sequence of bytes, with no interpretation. This is true whether UTF-8 or UCS-16 or whatever is used. Anything else can lead to bugs. "case independent" filesystems should also be gotten rid of due to these security concerns.
Your probably right but I think->
I can post this with all the crap already on here...
>The i18n movement which started some years ago solves a lot, but not everything.
>With it, only output is guaranteed to match the best gettext will find. What about the input?
>Multibyte strings, produced by input parsers like kinput2 or ami in an 8bit or 7bit
environment,
> are hard to handle and crack easily (if you press the delete button, it removes only half a
sign).
> kinput2 and ami cannot run together in one terminal, because code pages intersect. Start
and end
> sequences are one solution, but a bad one and one especially not meant for the long run.
>
> Imagine a document full of different languages; if I want a function that gives a line length
for
> this doc, it will be the hell,
I say that this is too bad. But if you do make one, make one that I can easily store like this.
Byte order be damned - eveyone has to do the work.
>and I haven't even mentioned what will happen when new languages
> with new start and end sequences are implemented.
I don't see what will happen.
>Also, we have so many applications which handle text and formatting.
> Integration of multiple language parsers into them may take 5 times more than implementing
> the problem-specific algorithms. I think something like Microsoft's IME, a central
(system-wide)
> solution, is needed here. Unfortunately, IME is not Open Source, and is therefore
un(sup)portable.
Are we supposed to look at what is out there?
What does IME do?
I don't have the time to check this out, but:
Lets look at it like this::
I want to keep as much of my text file in memory at once as I can. I am writing a spell
checker and
and it can't take up my whole machine. I want to be able to look up Western words and
Eastern words.
They are in my database and I want that in memory as much as I can fit.
: I want to store as Multi Byte Characters (like MBCS only 64 bit).
: most of my Data can be compressed because I am a Pentium 4 on a programmers desk
and have
CPU cycles to burn.
: I need to be able to traverse this with a pointer.
: I am going to traverse it only once anyway, no need to expand it.
: My data is in records (equiv. to LineFeed, '\0' at end of string, etc.)
C++ will do the job, but I have to use functions on my pointer class.
I want to write this code but once, and be able to read this stuff forwards and
backwards in any \
programming language.
Sounds like CORBA implementation of an interface called the characterPtr is in order that
everyone will use.
Things I need:
I use things like / and \ and + and ~ that everyone with a computer likes to use.
These would be nice to fit in with 1 byte characters so I can say in my compression that:
the next 116 characters are going to be single byte characters.
These would also be nice to fit in with 2 byte data of the next set of 115 charactrer.
My program want to recognize this character, so I will call it:
1111 1111 1111 11XX (substitute the ascii code for '/' for XX) (8 bytes)
it can be stored as :
XX
11XX
XX11
1111XX
XX1111
And, I already know how many bytes long it is, and the Motorolla or Intel storage format (my
'compression' told me this).
When I look at it using my characterPtr class, it always looks like 1111 1111 1111 11XX.
It is always kept stored compressed. I have to traverse it.
Problem : I don't use CORBA, or I can't because I program in bash.
Solution : write the equivalent of the characterPtr in your language to access te whole
world
of the newly defined character set that is always, wherever it is stored or transmitted,
compressed
using the same compression scheme:
If I have data that is 100 strings each an average length of 16 characters and all
characters are the 7 bit,
each string takes up about 20 bytes is encoded The extra 4 characters tell me length of
the string and that
it is all single byte and stored in intel format (don't matter for single byte, but I know
anyway).
Everyone in the world knows how to read those extra 4 characters and so to them (us) this
data looks like
sets of 64 bit characters.
Solution 2 : Write a String class that uses the base of this pointer so that it can traverse
the compression backwardly.
The computer can no longer define the language, we must do that part.
Re: Your probably right but I think->
Well,
I wish I hadn't posted that. I still have to consider adding to the compressed data,
etc.
Let's just forget I posted at all, shall we?
;-) Butuque!
a whole lifestyle
It has become a whole lifestyle; to write everything in lower-case letters and put a smiley on the end of the line seems to express global thinking.
indeed. =)
But if we are honest, we know it expresses nothing but a deficiency in modern character processing.
nope. i don't feel like that.
it's really a differentiated form of comunication.
for ex, i speak portuguese, and we have many accented characters. but i don't use any, not just because it's incompatible with some systems, but also because it's harder to type. (you see, i don't even use a capital 'i' for the 'i' word..)
i just use some adaptation, when apropriated.. (for ex. im portuguese "e" and "é" are two different words, and i type them as "e" and "eh". everybody understands..) and don't care about being gramatically correct anyway.
the important thing is to comunicate, and i type much like the way i talk..
i also find text mode smileys way nicer then the graphicals ones..
Japanese and input methods
The author should get an update on existing projects (software and otherwise): mlterm (mlterm.sourceforge.net) can already use more than 1 input method. You can change the input method on the fly, and it will do charset translation for you (so you can, e.g., use a Japanese input method to type Chinese words, or vice versa, if Unicode thinks they are the same). If you like to be confused, you can also change the charset on the fly too. And yes, there is such a thing as a Chinese console.
As for the future direction on the interoperability of input methods, work is already started on the implementation of IIIMF, the next-generation input method framework that takes even Microsoft Windows into account.
Of course, the reality now is that most programs cannot use more than 1 input method. If waiting for IIIMF is not realistic (and it is not realistic in the short time or the medium term), instead of lamenting on the reality of the largeness of Unicode and the CJK charsets, we should try to hack on the problem and try to tackle making programs able to use more than 1 input method. Perhaps it would be possible to create "input method proxies" that can call other input methods and translate the input to the target locale; the existence of mlterm shows that it is possible for a program to call more than one input method, and I see no reason why that program cannot be itself an XIM server.
(In fact, because of mlterm, I am trying to look for XIM servers for other writing systems such as Eastern/Western European, Greek, Russian, etc. But I can't find them. Perhaps only the people who frequently have to use them knows where these things are.)
We Chinese used to think that our language is not "scientific", and this caused our writing system to be despised and sabotaged by our own people. (I admire the Arabs and the Jews, that they can keep their writing systems r-to-l despite Western influence.) However, this is wrong because we can use input methods to enter Chinese, and the speed of Chinese touch typing is comparable to English touch typing. It is not productive to say a language or even a writing system is unscientific; even among alphabetic writing systems, the only truly scientific alphabet is Korean hangeul (if you don't believe this, dig through sci.lang archives).
Re: a whole lifestyle
> i also find text mode smileys way nicer then the graphicals ones..
yes, text mode smileys are way nicer than graphical ones.
The graphical smileys are *ugly*. And they intefere with the
normal (Chinese/Japanese) smileys I use. I almost always
turn graphical smileys off if a program "supports" them.
Re: What happend to rxvt?
The official web site for rxvt is at rxvt.org. Please look for updates there first.
Re: ASCII
If you read comp.fonts (through Google Groups I suppose, to dig out the old old articles), you may realize that ASCII does not even support English. Why? Because there is no code point for the two kinds of dashes that are required by grammar, and no distinction between opening and closing quotation marks. Worse (contrary to what some people think and try to make other people think the same), some code points (e.g., apostrophe and grave accents) have valid alternative meanings (e.g., closing and opening single quotation marks). Some English words require accent marks. And the "Icelandic" thorn and eth letters used to be English letters a very long time ago.
IMHO, the decreasing quality of English punctuation use is directly attributable to the spread of computers.
The charset conversion problem is not Unix-specific. Users of Chinese or Japanese Windows / Macintosh see it all the time. English-speaking people were just not used to seeing this.
Re: a whole lifestyle
> i just use some adaptation, when
> apropriated.. (for ex. im portuguese "e"
> and "é" are two different words,
> and i type them as "e" and "eh".
> everybody understands..) and don't care
> about being gramatically correct
> anyway.
Man, this is ugly. This is just like using
"naum" instead of "não". It is, like somebody
else said in this thread, a lack of self-respect
towards one's own language.
> the important thing is to comunicate,
> and i type much like the way i talk..
Yes, this is happening more and more. And it
is very unfortunate. The quality of written text is
decreasing. A friend of mine even wrote "aí eu
peguei e fui lá"... :(
Re: a whole lifestyle
That's true, but I only use this kind of writing when I'm "talking" to my friends on email, irc, icq.
When writing a letter or something more important I like to write the "right" way, and I still know to write correctly when I need to... :)
The language is always envolving through history, and this (net talk) is just one more adaptation.
In the middle of secrecy…
In the middle of secrecy…
(Sometimes, a smile can do more …)
Warning:
You can read further only if you can keep this information top secret from everybody, including your friends, and, especially from nobody. Do not copy this document and, God forbid, do not distribute it through e-mail, regular mail, by word of mouth or psychic ability, even if you do not have one.
www.tupbiosystems.com/... (www.tupbiosystems.com/...)
Re: It's not that bad, actually
> Also, it's just not true that Unicode
> text files start with a special
> character sequence. That might be a bad
> Windows habit, but it's not required by
> any standard.
I fail to see how could it be "bad". After all, it's just a matter of putting a zero-width space at the beginning of the file: even if programs don't recognize it, it will just show up on screen as... nothing, since a zero-width space doesn't have a glyph nor any width
You don't want to support storage in anything but UTF-8? well, who are you to decide? someone will probably have to, and the Byte Order Mark (an Unicode standard, just not a requirement, as any definition of "beginning of the stream" can't cover all the possible cases) is the smartest way it could have been done