[OpenAFS-devel] AFS vs UNICODE

Jeffrey Altman jaltman@secure-endpoints.com
Tue, 06 May 2008 17:00:58 -0400


Roland Kuhn wrote:

> Well, certainly. But I find it very irritating that a filesystem should 
> somehow interpret and _change_ a filename based on the assumption of 
> UTF-8 encoding, even if the filename's byte sequence happens to conform 
> to the UTF-8 rules. Why bother? It's much easier and much more portable 
> to regard filenames as opaque byte sequences.

If you do not canonicalize the form that is written to the file server
directory entry there are two problems:

(1) The same text in Unicode can be represented by different sequences 
of characters.  As a result you could have client A and client B both 
create a file with the same name that can not be visually distinguished 
by the end user.   Now which one do you open?

(2) Since the directory lookups are performed using a hash table, a file 
with the name being searched for might exist but it cannot be found 
because the input to the hash function on client B is different than the 
input used to create the entry on client A.

Storing file names as opaque octet sequences is broken in other ways. 
Depending on the character set used on the client the file name might or 
might not be representable since the octet sequence contains no 
indication whether the sequence is CP437, CP850, CP1252, ISO Latin-1,
ISO-Latin-9, UTF-7, UTF-8, etc.