On Jan 16, 2008, at 5:32 PM, Linus Torvalds wrote:
You're right, it doesn't actually have to store the normalized form. =20
And yes, it's possible to compare without normalizing them. =20
Admittedly, I don't know much about the implementation details of =20
unicode, but I would assume that the easiest way to compare two =20
strings is to normalize them first. But in the case of the filesystem, =20=
normalization actually is important if you're thinking about filenames =20=
in terms of characters rather than bytes. When I feed the filesystem a =20=
given unicode string, it has to find the file I'm talking about - =20
should it do a relatively expensive unicode-sensitive comparison of =20
all the filenames with the one I gave it, or should it just normalize =20=
all names and do the much cheaper lookup that way? I don't know about =20=
you, but I'd prefer to let my filesystem normalize the name and run =20
faster.
There's a difference between "looks similar" as in "Polish" vs =20
"polish", and actually is the same string as in "Ma<UMLAUT =20
MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid =20=
semantic meaning, normalization doesn't. The only way to argue that =20
normalization is wrong is by providing a good reason to preserve the =20
exact byte sequence, and so far the only reason I've seen is to help =20
git. Applications in general don't care one whit about the byte =20
sequence of the filename, they care about the underlying file the name =20=
represents. Additionally, it would be a terrible experience for a user =20=
to enter "M=E4rchen" and have the application say "sorry, I can't find =20=
this file" simply because the application used decomposed characters =20
and the filename used composed characters. Unless the user is =20
knowledgeable about the OS, filesystems, and unicode, they wouldn't =20
have a hope of figuring out what the problem was.
How do you figure? When I type "M=E4rchen", I'm typing a string, not a =20=
byte sequence. I have no control over the normalization of the =20
characters. Therefore, depending on what program I'm typing the name =20
in, I might use the same normalization as the filename, or I might =20
miss. It's completely out of my control. This is why the filesystem =20
has to step in and say "You composed that character differently, but I =20=
know you were trying to specify this file".
There are valid reasons for case to matter, but what reason is there =20
for "single character" vs" two character overlay" to matter in =20
filenames? They're different representations of the exact same string, =20=
and that's what a filename is - a string.
It seems like your arguments stem from the assumption that the user =20
cares about the byte sequence that represents the filename, which is =20
wrong. The user has no idea what the byte sequence is - the user cares =20=
about the string. Normalization is meant to help computers, not users, =20=
and claiming that different normalizations of the same string produces =20=
different meaningful strings is complete bunk.
If you were to have two different files on your system, both of them =20
called "M=E4rchen", but one precomposed and one decomposed, how would =20=
you specify which one you wanted? Unless Linux has a special text =20
input system which gives the user control over the normalization of =20
their typed characters, you'd have to write out the UTF-8 bytes =20
manually.
I just don't understand this insistence on treating the specific byte =20=
sequence that makes up the filename as significant.
-Kevin Ballard
--=20
Kevin Ballard
http://kevin.sb.orgkevin@sb.orghttp://www.tildesoft.com