[OpenAFS-devel] AFS vs UNICODE
Jeffrey Altman
jaltman@secure-endpoints.com
Tue, 06 May 2008 17:00:58 -0400
Roland Kuhn wrote:
> Well, certainly. But I find it very irritating that a filesystem should
> somehow interpret and _change_ a filename based on the assumption of
> UTF-8 encoding, even if the filename's byte sequence happens to conform
> to the UTF-8 rules. Why bother? It's much easier and much more portable
> to regard filenames as opaque byte sequences.
If you do not canonicalize the form that is written to the file server
directory entry there are two problems:
(1) The same text in Unicode can be represented by different sequences
of characters. As a result you could have client A and client B both
create a file with the same name that can not be visually distinguished
by the end user. Now which one do you open?
(2) Since the directory lookups are performed using a hash table, a file
with the name being searched for might exist but it cannot be found
because the input to the hash function on client B is different than the
input used to create the entry on client A.
Storing file names as opaque octet sequences is broken in other ways.
Depending on the character set used on the client the file name might or
might not be representable since the octet sequence contains no
indication whether the sequence is CP437, CP850, CP1252, ISO Latin-1,
ISO-Latin-9, UTF-7, UTF-8, etc.