Re: OT: character encodings (was: Linux 2.6.20-rc4)

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Russell King
Date: Sunday, January 7, 2007 - 8:38 am

On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote:

Wrong.  The problem is partly caused by not everything understanding
multi-byte character encodings, and text files containing absolutely
_no_ information about their character encodings.

When a text file is stored on disk, there's no way to tell what
character set the characters in that file belong to.  As a result,
ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
UTF-8 folk assume all text files are UTF-8 encoded.  This leads to
utter confusion.

To see what I mean, try the following:

$ git log | head -n 1000 > o
$ file -i o
o: text/x-c; charset=iso-8859-1

According to that, the charset of the 'git log' output (which on that
test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
was right to include it as ISO-8859-1.

In reality, the output from git log contains an ad-hoc collection of
character sets making its interpretation under any one character set
incorrect.


In other words, the UTF-8 luddites require the entire Internet to
upgrade to UTF-8 for UTF-8 to work properly.

I _regularly_ struggle with idiotic programs that assume that the world
is UTF-8 and nothing else.  UTF-8 does _not_ solve these inter-operability
problems - it only makes the entire situation worse by introducing yet
another different charset.  (Yes, it's also true that there are programs
which assume the world is only another, different, character set.)

Rather than having these problems fixed properly (by looking at the LANG
environment variable) many of these programs now assume that the world
is UTF-8.  It isn't.

elinks is one such program.  It now assumes UTF-8 _only_ displays.
That's no better than programs which assume ISO-8859-1 only or US-ASCII
only.

So, in short, UTF-8 is all fine and dandy if your _entire_ universe
is UTF-8 enabled.  If you're operating in a mixed charset environment
it's one bloody big pain in the butt.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Linux 2.6.20-rc4, Linus Torvalds, (Sat Jan 6, 11:19 pm)
Re: Linux 2.6.20-rc4, Jan Engelhardt, (Sun Jan 7, 3:56 am)
Re: Linux 2.6.20-rc4, Russell King, (Sun Jan 7, 4:44 am)
Re: Linux 2.6.20-rc4, Akula2, (Sun Jan 7, 5:15 am)
Re: Linux 2.6.20-rc4, Russell King, (Sun Jan 7, 5:55 am)
OT: character encodings (was: Linux 2.6.20-rc4), Tilman Schmidt, (Sun Jan 7, 6:06 am)
Re: Linux 2.6.20-rc4, Alan, (Sun Jan 7, 6:23 am)
Re: Linux 2.6.20-rc4, Akula2, (Sun Jan 7, 6:38 am)
Re: Linux 2.6.20-rc4, Willy Tarreau, (Sun Jan 7, 6:53 am)
Re: Linux 2.6.20-rc4, Akula2, (Sun Jan 7, 7:23 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), David Woodhouse, (Sun Jan 7, 8:13 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Russell King, (Sun Jan 7, 8:38 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), David Woodhouse, (Sun Jan 7, 9:29 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Russell King, (Sun Jan 7, 10:06 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Sun Jan 7, 12:11 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Sun Jan 7, 12:12 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Russell King, (Sun Jan 7, 12:17 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Russell King, (Sun Jan 7, 12:20 pm)
Re: OT: character encodings, Tilman Schmidt, (Sun Jan 7, 12:29 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Robin Rosenberg, (Sun Jan 7, 12:58 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Dave Jones, (Sun Jan 7, 1:05 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Sun Jan 7, 1:40 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Willy Tarreau, (Sun Jan 7, 1:48 pm)
Re: Linux 2.6.20-rc4, Peter Osterlund, (Sun Jan 7, 1:57 pm)
Re: Linux 2.6.20-rc4, Peter Osterlund, (Sun Jan 7, 2:04 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Xavier Bestel, (Sun Jan 7, 2:07 pm)
Re: Linux 2.6.20-rc4, Gene Heskett, (Sun Jan 7, 2:22 pm)
Re: Linux 2.6.20-rc4, Linus Torvalds, (Sun Jan 7, 3:50 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Adrian Bunk, (Sun Jan 7, 4:37 pm)
2.6.20-rc4: known unfixed regressions, Adrian Bunk, (Sun Jan 7, 5:22 pm)
2.6.20-rc4: known regressions with patches available, Adrian Bunk, (Sun Jan 7, 5:25 pm)
Re: 2.6.20-rc4: known regressions with patches available, Marcel Holtmann, (Sun Jan 7, 5:33 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Willy Tarreau, (Sun Jan 7, 5:38 pm)
Re: Linux 2.6.20-rc4, David Miller, (Sun Jan 7, 6:00 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Adrian Bunk, (Sun Jan 7, 6:03 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Willy Tarreau, (Sun Jan 7, 6:14 pm)
Re: 2.6.20-rc4: known unfixed regressions, Bernhard Schmidt, (Sun Jan 7, 6:20 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Sun Jan 7, 6:22 pm)
Re: OT: character encodings, Tilman Schmidt, (Sun Jan 7, 6:32 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4) , Horst H. von Brand, (Sun Jan 7, 6:40 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Adrian Bunk, (Sun Jan 7, 6:45 pm)
Re: OT: character encodings, Adrian Bunk, (Sun Jan 7, 6:59 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), David Woodhouse, (Sun Jan 7, 9:42 pm)
Re: Linux 2.6.20-rc4, Peter Osterlund, (Sun Jan 7, 11:38 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Sun Jan 7, 11:52 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Adrian Bunk, (Mon Jan 8, 1:02 am)
Re: Linux 2.6.20-rc4, Mariusz Kozlowski, (Mon Jan 8, 7:50 am)
Re: Linux 2.6.20-rc4, Sylvain Munaut, (Mon Jan 8, 7:58 am)
Re: Linux 2.6.20-rc4, Dmitry Torokhov, (Mon Jan 8, 8:50 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Pavel Machek, (Mon Jan 8, 9:14 am)
Re: Linux 2.6.20-rc4, Jean Delvare, (Mon Jan 8, 12:11 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Valdis.Kletnieks, (Mon Jan 8, 12:53 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Mon Jan 8, 1:17 pm)
Re: Linux 2.6.20-rc4, Peter Osterlund, (Mon Jan 8, 1:49 pm)
Re: Linux 2.6.20-rc4, David Miller, (Mon Jan 8, 2:52 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Ken Moffat, (Mon Jan 8, 3:00 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Tim Pepper, (Mon Jan 8, 3:17 pm)
Re: Linux 2.6.20-rc4, Patrick McHardy, (Mon Jan 8, 3:33 pm)
Re: Linux 2.6.20-rc4, Peter Osterlund, (Mon Jan 8, 4:02 pm)
Re: Linux 2.6.20-rc4, Linus Torvalds, (Mon Jan 8, 4:12 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Mon Jan 8, 4:21 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Mon Jan 8, 4:30 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Eberhard Moenkeberg, (Mon Jan 8, 4:34 pm)
Re: Linux 2.6.20-rc4, Benjamin Herrenschmidt, (Mon Jan 8, 5:38 pm)
Re: Linux 2.6.20-rc4, Greg KH, (Mon Jan 8, 5:56 pm)
Re: Linux 2.6.20-rc4, Benjamin Herrenschmidt, (Mon Jan 8, 7:05 pm)
Re: Linux 2.6.20-rc4, Adrian Bunk, (Mon Jan 8, 8:42 pm)
2.6.20-rc4: known unfixed regressions (v2), Adrian Bunk, (Mon Jan 8, 10:25 pm)
2.6.20-rc4: known regressions with patches (v2), Adrian Bunk, (Mon Jan 8, 10:51 pm)
Re: Linux 2.6.20-rc4, David Woodhouse, (Tue Jan 9, 12:04 am)
Re: Linux 2.6.20-rc4, Sylvain Munaut, (Tue Jan 9, 12:14 am)
Re: Linux 2.6.20-rc4, Greg KH, (Tue Jan 9, 12:18 am)
Re: Linux 2.6.20-rc4, David Woodhouse, (Tue Jan 9, 12:28 am)
Re: Linux 2.6.20-rc4, David Miller, (Tue Jan 9, 12:39 am)
Re: Linux 2.6.20-rc4, Benjamin Herrenschmidt, (Tue Jan 9, 2:04 am)
Re: Linux 2.6.20-rc4, Benjamin Herrenschmidt, (Tue Jan 9, 2:07 am)
Re: Linux 2.6.20-rc4, Benjamin Herrenschmidt, (Tue Jan 9, 2:08 am)
Re: 2.6.20-rc4: known unfixed regressions (v2), Linus Torvalds, (Tue Jan 9, 10:58 am)
Re: 2.6.20-rc4: known unfixed regressions (v2), Malte , (Tue Jan 9, 11:08 am)
Re: 2.6.20-rc4: known unfixed regressions (v2), Linus Torvalds, (Tue Jan 9, 11:30 am)
Re: 2.6.20-rc4: known unfixed regressions (v2), Adrian Bunk, (Tue Jan 9, 1:28 pm)
Re: 2.6.20-rc4: known unfixed regressions (v2), Vladimir V. Saveliev, (Wed Jan 10, 5:24 pm)
Re: 2.6.20-rc4: known unfixed regressions (v2), Nick Piggin, (Wed Jan 10, 6:00 pm)
2.6.20-rc4: known unfixed regressions (v3), Adrian Bunk, (Wed Jan 10, 10:10 pm)
2.6.20-rc4: known regressions with patches (v3), Adrian Bunk, (Wed Jan 10, 10:13 pm)
Re: 2.6.20-rc4: known unfixed regressions (v3), Nick Piggin, (Wed Jan 10, 11:43 pm)
Re: 2.6.20-rc4: known unfixed regressions (v3), Adrian Bunk, (Thu Jan 11, 1:45 am)
Re: 2.6.20-rc4: known unfixed regressions (v3), Jiri Kosina, (Thu Jan 11, 3:21 am)
Re: 2.6.20-rc4: known unfixed regressions (v3), Adrian Bunk, (Thu Jan 11, 3:54 am)
Re: 2.6.20-rc4: known unfixed regressions (v3), CIJOML, (Thu Jan 11, 4:08 am)
Re: 2.6.20-rc4: known unfixed regressions (v2), Vladimir V. Saveliev, (Thu Jan 11, 6:12 am)
Re: 2.6.20-rc4: known regressions with patches (v3), David Chinner, (Thu Jan 11, 2:39 pm)