On Jan 21, 2008, at 2:41 PM, Linus Torvalds wrote:
I believe I already responded to the issue of hashing. In summary,
just re-define your hash function to convert the string to a specific
encoding. Sure, you'll lose some speed, but we're already assuming
that it's worth taking a speed hit in order to treat filenames as
strings (please don't argue this point, it's an opinion, not a factual
statement, and I'm not necessarily saying I agree with it, I'm just
saying it's valid).
Perhaps that is the reason, I don't know (neither do you, you're just
guessing). However, my point still stands - as long as the string
stays canonically equivalent, it doesn't matter to me if the
filesystem changes the encoding, since I'm working at the string level.
Someone has to look at the octets, but it doesn't have to be me. As
long as I use unicode-aware libraries and such, I can let the
underlying system care about the byte order and my code will be clean.
It does? Why on earth should it do that? Filename doesn't contribute
to the listed filesize on OS X.
kevin@KBLAPTOP:~> echo foo > foo; echo foo > foobar
kevin@KBLAPTOP:~> ls -l foo*
-rw-r--r-- 1 kevin kevin 4 Jan 21 14:50 foo
-rw-r--r-- 1 kevin kevin 4 Jan 21 14:50 foobar
It would be singularly stupid for the filesize to reflect the
filename, especially since this means you would report different
filesizes for hardlinks.
Visible at some level, sure, but not visible at the level my code
works on. And thus, I don't have to care about it.
I'm not sure what you mean. The byte sequence is different from Latin1
to UTF-8 even if you use NFC, so I don't think, in this case, it makes
any difference whether you use NFC or NFD. Yes, the codepoints are the
same in Latin1 and UTF-8 if you use NFC, but that's hardly relevant.
Please correct me if I'm wrong, but I believe Latin1->UTF-8->Latin1
conversion will always produce the same Latin1 text whether you use
NFC or NFD.
The only reason it's particularly inconvenient is because it's
different from what most other systems picked. And if you want to
blame someone for that, blame Unicode for having so many different
normalization forms.
-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.orgkevin@sb.orghttp://www.tildesoft.com