On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote:
Wrong. The problem is partly caused by not everything understanding
multi-byte character encodings, and text files containing absolutely
_no_ information about their character encodings.
When a text file is stored on disk, there's no way to tell what
character set the characters in that file belong to. As a result,
ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
UTF-8 folk assume all text files are UTF-8 encoded. This leads to
utter confusion.
To see what I mean, try the following:
$ git log | head -n 1000 > o
$ file -i o
o: text/x-c; charset=iso-8859-1
According to that, the charset of the 'git log' output (which on that
test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
was right to include it as ISO-8859-1.
In reality, the output from git log contains an ad-hoc collection of
character sets making its interpretation under any one character set
incorrect.
In other words, the UTF-8 luddites require the entire Internet to
upgrade to UTF-8 for UTF-8 to work properly.
I _regularly_ struggle with idiotic programs that assume that the world
is UTF-8 and nothing else. UTF-8 does _not_ solve these inter-operability
problems - it only makes the entire situation worse by introducing yet
another different charset. (Yes, it's also true that there are programs
which assume the world is only another, different, character set.)
Rather than having these problems fixed properly (by looking at the LANG
environment variable) many of these programs now assume that the world
is UTF-8. It isn't.
elinks is one such program. It now assumes UTF-8 _only_ displays.
That's no better than programs which assume ISO-8859-1 only or US-ASCII
only.
So, in short, UTF-8 is all fine and dandy if your _entire_ universe
is UTF-8 enabled. If you're operating in a mixed charset environment
it's one bloody big pain in the butt.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
-