Re: git on MacOSX and files with decomposed utf-8 file names

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Kevin Ballard
Date: Wednesday, January 16, 2008 - 4:11 pm

On Jan 16, 2008, at 5:32 PM, Linus Torvalds wrote:





You're right, it doesn't actually have to store the normalized form. =20
And yes, it's possible to compare without normalizing them. =20
Admittedly, I don't know much about the implementation details of =20
unicode, but I would assume that the easiest way to compare two =20
strings is to normalize them first. But in the case of the filesystem, =20=

normalization actually is important if you're thinking about filenames =20=

in terms of characters rather than bytes. When I feed the filesystem a =20=

given unicode string, it has to find the file I'm talking about - =20
should it do a relatively expensive unicode-sensitive comparison of =20
all the filenames with the one I gave it, or should it just normalize =20=

all names and do the much cheaper lookup that way? I don't know about =20=

you, but I'd prefer to let my filesystem normalize the name and run =20
faster.



There's a difference between "looks similar" as in "Polish" vs =20
"polish", and actually is the same string as in "Ma<UMLAUT =20
MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid =20=

semantic meaning, normalization doesn't. The only way to argue that =20
normalization is wrong is by providing a good reason to preserve the =20
exact byte sequence, and so far the only reason I've seen is to help =20
git. Applications in general don't care one whit about the byte =20
sequence of the filename, they care about the underlying file the name =20=

represents. Additionally, it would be a terrible experience for a user =20=

to enter "M=E4rchen" and have the application say "sorry, I can't find =20=

this file" simply because the application used decomposed characters =20
and the filename used composed characters. Unless the user is =20
knowledgeable about the OS, filesystems, and unicode, they wouldn't =20
have a hope of figuring out what the problem was.


How do you figure? When I type "M=E4rchen", I'm typing a string, not a =20=

byte sequence. I have no control over the normalization of the =20
characters. Therefore, depending on what program I'm typing the name =20
in, I might use the same normalization as the filename, or I might =20
miss. It's completely out of my control. This is why the filesystem =20
has to step in and say "You composed that character differently, but I =20=

know you were trying to specify this file".



There are valid reasons for case to matter, but what reason is there =20
for "single character" vs" two character overlay" to matter in =20
filenames? They're different representations of the exact same string, =20=

and that's what a filename is - a string.

It seems like your arguments stem from the assumption that the user =20
cares about the byte sequence that represents the filename, which is =20
wrong. The user has no idea what the byte sequence is - the user cares =20=

about the string. Normalization is meant to help computers, not users, =20=

and claiming that different normalizations of the same string produces =20=

different meaningful strings is complete bunk.

If you were to have two different files on your system, both of them =20
called "M=E4rchen", but one precomposed and one decomposed, how would =20=

you specify which one you wanted? Unless Linux has a special text =20
input system which gives the user control over the normalization of =20
their typed characters, you'd have to write out the UTF-8 bytes =20
manually.

I just don't understand this insistence on treating the specific byte =20=

sequence that makes up the filename as significant.

-Kevin Ballard

--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 8:34 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 9:32 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 3:23 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Eyvind Bernhardsen, (Wed Jan 16, 3:37 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Wed Jan 16, 4:03 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Kevin Ballard, (Wed Jan 16, 4:11 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 5:33 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 5:35 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Wed Jan 16, 5:54 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 5:57 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 6:08 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Wed Jan 16, 9:51 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Wed Jan 16, 10:11 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Junio C Hamano, (Wed Jan 16, 10:15 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Mitch Tishmack, (Thu Jan 17, 12:11 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 3:08 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 3:22 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 3:28 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 4:10 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 4:46 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 4:51 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 5:53 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 6:05 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 6:40 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 8:57 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 11:18 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Thu Jan 17, 11:42 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Thu Jan 17, 11:44 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Thu Jan 17, 12:11 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 3:09 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Robin Rosenberg, (Thu Jan 17, 5:44 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Robin Rosenberg, (Thu Jan 17, 6:05 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Robin Rosenberg, (Thu Jan 17, 6:27 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Robin Rosenberg, (Fri Jan 18, 2:42 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Fri Jan 18, 10:11 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Fri Jan 18, 1:50 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Sat Jan 19, 11:58 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Sat Jan 19, 3:58 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Sat Jan 19, 5:11 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Sat Jan 19, 10:45 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Dmitry Potapov, (Sat Jan 19, 11:14 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Sat Jan 19, 11:53 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Sun Jan 20, 12:26 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Sun Jan 20, 2:34 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Sun Jan 20, 6:15 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Mon Jan 21, 11:12 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Mon Jan 21, 11:16 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Mon Jan 21, 12:41 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 2:06 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 2:17 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 2:43 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 3:45 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 3:56 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 8:21 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Eric W. Biederman, (Tue Jan 22, 7:46 pm)