Re: Why is this taking the form of a religious debate?
> To be more clear about this, getting the
> Nth character of a fixed-char-size
> string is a constant-time operation (O1)
> and takes the same amount of time
> whether N is equal to 5 or 500. On the
> other hand, getting that same character
> in a variable-width-char string is a
> linear operation (On) and takes
> approximately one hundred times as long
> to get the 500th character versus the
> 5th character.
> The second step is to impress upon
> programmers the algorithmic efficiency
> advantages of using fixed-width
> characters in their programs instead of
this is not necesarrily true: with combining characters, bidirectional text, and other unicode features you will still need to do the same amount of work with wide characters as you do with multibyte characters.
An additional benefit of UTF-8 internal use is byte-order independance, which bypasses a perrennial problem faced when making code portable.
Sorry, UTF-8 *is* the way to go
Rather than re-writing virtually all existing source code, it makes infinitely more sense to go with UTF-8. In fact this decision has already been made, and disparate operating systems from windows to linux (in a big way linux) are slowly standardizing on it.
Utf-8 gives you compatibility with ascii, full access to the full 31bit unicode (unicode saves the one extra bit for error codes, sign bits etc, very smart!), an error-recoverable byte stream, stateless computationally trivial conversion, very low overhead for most existing text, ovewhelming compatibility with existing software (no code changes for most software!!), relativly trivial string width computation
see for yourself: "http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c"
Using UCS-4 would be a huge headache with few benefits. It would also introduce all new kinds of bugs, like for example assuming than number of ucs-4 chars would equal the display width of the string (not true, see combining chars, zero width glyphs, etc)