On Sun, 2007-01-07 at 15:38 +0000, Russell King wrote:
No, that's a different problem; not the one you were referring to above.
And it's a problem which is rapidly diminishing, too.
That's a real problem, yes -- but it was a problem long before UTF-8 was
added to the collection of character sets in use. Even within the UK, we
had to choose between ISO8859-1 and ISO8859-15.
Only if you are making different assumptions about the _same_ set of
files, on the _same_ system. But that would be silly.
If I suddenly "assume" that my laptop has a Dvorak keyboard layout
despite that blatantly not being true, I'll get the same kind of
confusion. That isn't Dvorak's fault, either.
If, on the other hand, I have one system which is entirely ISO8859-1 and
a separate system which is entirely UTF-8, each of those are _fine_ and
unconfusing. Obviously I have to make sure files are properly labelled
and converted in transport between different systems -- but that's
nothing new.
Yes. When you stored it on disk, the character set information was lost.
If you were running a mixed-charset system then attempting to recreating
the lost information with heuristics and assumptions is obviously going
to be problematic.
Actually, because UTF-8 allows me to run a system which is purely based
on a single character set, I get better results when I try the same
trick:
shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o
shinybook /shiny/git/mtd-2.6 $ file -i o
o: text/plain; charset=utf-8
Again, the problem of labelling isn't at all new to UTF-8. The only
thing that's new with UTF-8 is that it's now actually _practical_ to
have a system which only uses one character set throughout, and which
thus _can_ get its 'guess' right when you don't bother to label
everything.
No, the contents of the git log ought to be UTF-8, unless people have
been misusing it. Git stores its text in UTF-8 (by default), and is
capable of converting to and from legacy character sets on input
(git-commit) and output (git-log).
(Obviously, that's likely to be lossy if you convert it to any given
legacy character set, because ∀ legacy character set, ∃ characters
within UTF-8 that aren't in that legacy character set.)
Not at all. The problems arise when character set information is lost,
which can happen at any point during the flow of information.
Anything we can do to reduce the likelihood of charset information being
lost is an overall improvement. We already demonstrated an example
(git-log > o; file -i o) of a case where a _consistent_ system gets it
right, while an inconsistent system introduces an error.
If any individual system processes all text in a single character set,
then that system is no longer a likely source of corruption due to
labelling errors. And because UTF-8 fully covers the set of characters
which can be represented in the legacy character sets, it allows us to
deploy systems which do just that.
I don't think I've encountered such a program in my distribution of
choice. If I had, I would have filed a bug. Making assumptions about
character sets, outside of the locally-controlled environment, is
invalid. That's been true since the first 8-bit character sets, if not
longer.
A mixed charset environment was _already_ a pain in the butt, because
almost nobody got labelling right. It's wrong to blame that on UTF-8.
--
dwmw2
-