Re: OT: character encodings (was: Linux 2.6.20-rc4)

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: David Woodhouse
Date: Sunday, January 7, 2007 - 9:29 am

On Sun, 2007-01-07 at 15:38 +0000, Russell King wrote:

No, that's a different problem; not the one you were referring to above.
And it's a problem which is rapidly diminishing, too.


That's a real problem, yes -- but it was a problem long before UTF-8 was
added to the collection of character sets in use. Even within the UK, we
had to choose between ISO8859-1 and ISO8859-15.


Only if you are making different assumptions about the _same_ set of
files, on the _same_ system. But that would be silly.

If I suddenly "assume" that my laptop has a Dvorak keyboard layout
despite that blatantly not being true, I'll get the same kind of
confusion. That isn't Dvorak's fault, either.

If, on the other hand, I have one system which is entirely ISO8859-1 and
a separate system which is entirely UTF-8, each of those are _fine_ and
unconfusing. Obviously I have to make sure files are properly labelled
and converted in transport between different systems -- but that's
nothing new.


Yes. When you stored it on disk, the character set information was lost.
If you were running a mixed-charset system then attempting to recreating
the lost information with heuristics and assumptions is obviously going
to be problematic.

Actually, because UTF-8 allows me to run a system which is purely based
on a single character set, I get better results when I try the same
trick:
	shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o
	shinybook /shiny/git/mtd-2.6 $ file -i o
	o: text/plain; charset=utf-8

Again, the problem of labelling isn't at all new to UTF-8. The only
thing that's new with UTF-8 is that it's now actually _practical_ to
have a system which only uses one character set throughout, and which
thus _can_ get its 'guess' right when you don't bother to label
everything.


No, the contents of the git log ought to be UTF-8, unless people have
been misusing it. Git stores its text in UTF-8 (by default), and is
capable of converting to and from legacy character sets on input
(git-commit) and output (git-log).

(Obviously, that's likely to be lossy if you convert it to any given
legacy character set, because ∀ legacy character set, ∃ characters
within UTF-8 that aren't in that legacy character set.)
 

Not at all. The problems arise when character set information is lost,
which can happen at any point during the flow of information.

Anything we can do to reduce the likelihood of charset information being
lost is an overall improvement. We already demonstrated an example
(git-log > o; file -i o) of a case where a _consistent_ system gets it
right, while an inconsistent system introduces an error.

If any individual system processes all text in a single character set,
then that system is no longer a likely source of corruption due to
labelling errors. And because UTF-8 fully covers the set of characters
which can be represented in the legacy character sets, it allows us to
deploy systems which do just that.


I don't think I've encountered such a program in my distribution of
choice. If I had, I would have filed a bug. Making assumptions about
character sets, outside of the locally-controlled environment, is
invalid. That's been true since the first 8-bit character sets, if not
longer.


A mixed charset environment was _already_ a pain in the butt, because
almost nobody got labelling right. It's wrong to blame that on UTF-8.

-- 
dwmw2

-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Linux 2.6.20-rc4, Linus Torvalds, (Sat Jan 6, 11:19 pm)
Re: Linux 2.6.20-rc4, Jan Engelhardt, (Sun Jan 7, 3:56 am)
Re: Linux 2.6.20-rc4, Russell King, (Sun Jan 7, 4:44 am)
Re: Linux 2.6.20-rc4, Akula2, (Sun Jan 7, 5:15 am)
Re: Linux 2.6.20-rc4, Russell King, (Sun Jan 7, 5:55 am)
OT: character encodings (was: Linux 2.6.20-rc4), Tilman Schmidt, (Sun Jan 7, 6:06 am)
Re: Linux 2.6.20-rc4, Alan, (Sun Jan 7, 6:23 am)
Re: Linux 2.6.20-rc4, Akula2, (Sun Jan 7, 6:38 am)
Re: Linux 2.6.20-rc4, Willy Tarreau, (Sun Jan 7, 6:53 am)
Re: Linux 2.6.20-rc4, Akula2, (Sun Jan 7, 7:23 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), David Woodhouse, (Sun Jan 7, 8:13 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Russell King, (Sun Jan 7, 8:38 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), David Woodhouse, (Sun Jan 7, 9:29 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Russell King, (Sun Jan 7, 10:06 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Sun Jan 7, 12:11 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Sun Jan 7, 12:12 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Russell King, (Sun Jan 7, 12:17 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Russell King, (Sun Jan 7, 12:20 pm)
Re: OT: character encodings, Tilman Schmidt, (Sun Jan 7, 12:29 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Robin Rosenberg, (Sun Jan 7, 12:58 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Dave Jones, (Sun Jan 7, 1:05 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Sun Jan 7, 1:40 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Willy Tarreau, (Sun Jan 7, 1:48 pm)
Re: Linux 2.6.20-rc4, Peter Osterlund, (Sun Jan 7, 1:57 pm)
Re: Linux 2.6.20-rc4, Peter Osterlund, (Sun Jan 7, 2:04 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Xavier Bestel, (Sun Jan 7, 2:07 pm)
Re: Linux 2.6.20-rc4, Gene Heskett, (Sun Jan 7, 2:22 pm)
Re: Linux 2.6.20-rc4, Linus Torvalds, (Sun Jan 7, 3:50 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Adrian Bunk, (Sun Jan 7, 4:37 pm)
2.6.20-rc4: known unfixed regressions, Adrian Bunk, (Sun Jan 7, 5:22 pm)
2.6.20-rc4: known regressions with patches available, Adrian Bunk, (Sun Jan 7, 5:25 pm)
Re: 2.6.20-rc4: known regressions with patches available, Marcel Holtmann, (Sun Jan 7, 5:33 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Willy Tarreau, (Sun Jan 7, 5:38 pm)
Re: Linux 2.6.20-rc4, David Miller, (Sun Jan 7, 6:00 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Adrian Bunk, (Sun Jan 7, 6:03 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Willy Tarreau, (Sun Jan 7, 6:14 pm)
Re: 2.6.20-rc4: known unfixed regressions, Bernhard Schmidt, (Sun Jan 7, 6:20 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Sun Jan 7, 6:22 pm)
Re: OT: character encodings, Tilman Schmidt, (Sun Jan 7, 6:32 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4) , Horst H. von Brand, (Sun Jan 7, 6:40 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Adrian Bunk, (Sun Jan 7, 6:45 pm)
Re: OT: character encodings, Adrian Bunk, (Sun Jan 7, 6:59 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), David Woodhouse, (Sun Jan 7, 9:42 pm)
Re: Linux 2.6.20-rc4, Peter Osterlund, (Sun Jan 7, 11:38 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Sun Jan 7, 11:52 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Adrian Bunk, (Mon Jan 8, 1:02 am)
Re: Linux 2.6.20-rc4, Mariusz Kozlowski, (Mon Jan 8, 7:50 am)
Re: Linux 2.6.20-rc4, Sylvain Munaut, (Mon Jan 8, 7:58 am)
Re: Linux 2.6.20-rc4, Dmitry Torokhov, (Mon Jan 8, 8:50 am)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Pavel Machek, (Mon Jan 8, 9:14 am)
Re: Linux 2.6.20-rc4, Jean Delvare, (Mon Jan 8, 12:11 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Valdis.Kletnieks, (Mon Jan 8, 12:53 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Mon Jan 8, 1:17 pm)
Re: Linux 2.6.20-rc4, Peter Osterlund, (Mon Jan 8, 1:49 pm)
Re: Linux 2.6.20-rc4, David Miller, (Mon Jan 8, 2:52 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Ken Moffat, (Mon Jan 8, 3:00 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Tim Pepper, (Mon Jan 8, 3:17 pm)
Re: Linux 2.6.20-rc4, Patrick McHardy, (Mon Jan 8, 3:33 pm)
Re: Linux 2.6.20-rc4, Peter Osterlund, (Mon Jan 8, 4:02 pm)
Re: Linux 2.6.20-rc4, Linus Torvalds, (Mon Jan 8, 4:12 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Mon Jan 8, 4:21 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Jan Engelhardt, (Mon Jan 8, 4:30 pm)
Re: OT: character encodings (was: Linux 2.6.20-rc4), Eberhard Moenkeberg, (Mon Jan 8, 4:34 pm)
Re: Linux 2.6.20-rc4, Benjamin Herrenschmidt, (Mon Jan 8, 5:38 pm)
Re: Linux 2.6.20-rc4, Greg KH, (Mon Jan 8, 5:56 pm)
Re: Linux 2.6.20-rc4, Benjamin Herrenschmidt, (Mon Jan 8, 7:05 pm)
Re: Linux 2.6.20-rc4, Adrian Bunk, (Mon Jan 8, 8:42 pm)
2.6.20-rc4: known unfixed regressions (v2), Adrian Bunk, (Mon Jan 8, 10:25 pm)
2.6.20-rc4: known regressions with patches (v2), Adrian Bunk, (Mon Jan 8, 10:51 pm)
Re: Linux 2.6.20-rc4, David Woodhouse, (Tue Jan 9, 12:04 am)
Re: Linux 2.6.20-rc4, Sylvain Munaut, (Tue Jan 9, 12:14 am)
Re: Linux 2.6.20-rc4, Greg KH, (Tue Jan 9, 12:18 am)
Re: Linux 2.6.20-rc4, David Woodhouse, (Tue Jan 9, 12:28 am)
Re: Linux 2.6.20-rc4, David Miller, (Tue Jan 9, 12:39 am)
Re: Linux 2.6.20-rc4, Benjamin Herrenschmidt, (Tue Jan 9, 2:04 am)
Re: Linux 2.6.20-rc4, Benjamin Herrenschmidt, (Tue Jan 9, 2:07 am)
Re: Linux 2.6.20-rc4, Benjamin Herrenschmidt, (Tue Jan 9, 2:08 am)
Re: 2.6.20-rc4: known unfixed regressions (v2), Linus Torvalds, (Tue Jan 9, 10:58 am)
Re: 2.6.20-rc4: known unfixed regressions (v2), Malte , (Tue Jan 9, 11:08 am)
Re: 2.6.20-rc4: known unfixed regressions (v2), Linus Torvalds, (Tue Jan 9, 11:30 am)
Re: 2.6.20-rc4: known unfixed regressions (v2), Adrian Bunk, (Tue Jan 9, 1:28 pm)
Re: 2.6.20-rc4: known unfixed regressions (v2), Vladimir V. Saveliev, (Wed Jan 10, 5:24 pm)
Re: 2.6.20-rc4: known unfixed regressions (v2), Nick Piggin, (Wed Jan 10, 6:00 pm)
2.6.20-rc4: known unfixed regressions (v3), Adrian Bunk, (Wed Jan 10, 10:10 pm)
2.6.20-rc4: known regressions with patches (v3), Adrian Bunk, (Wed Jan 10, 10:13 pm)
Re: 2.6.20-rc4: known unfixed regressions (v3), Nick Piggin, (Wed Jan 10, 11:43 pm)
Re: 2.6.20-rc4: known unfixed regressions (v3), Adrian Bunk, (Thu Jan 11, 1:45 am)
Re: 2.6.20-rc4: known unfixed regressions (v3), Jiri Kosina, (Thu Jan 11, 3:21 am)
Re: 2.6.20-rc4: known unfixed regressions (v3), Adrian Bunk, (Thu Jan 11, 3:54 am)
Re: 2.6.20-rc4: known unfixed regressions (v3), CIJOML, (Thu Jan 11, 4:08 am)
Re: 2.6.20-rc4: known unfixed regressions (v2), Vladimir V. Saveliev, (Thu Jan 11, 6:12 am)
Re: 2.6.20-rc4: known regressions with patches (v3), David Chinner, (Thu Jan 11, 2:39 pm)