[OpenAFS-devel] AFS vs UNICODE

Wed, 7 May 2008 00:30:55 +0200

Hello Jeffrey,

On Tue, May 06, 2008 at 05:00:58PM -0400, Jeffrey Altman wrote:
> Roland Kuhn wrote:
> 
> >Well, certainly. But I find it very irritating that a filesystem should 
> >somehow interpret and _change_ a filename based on the assumption of 
> >UTF-8 encoding, even if the filename's byte sequence happens to conform 
> >to the UTF-8 rules. Why bother? It's much easier and much more portable 
> >to regard filenames as opaque byte sequences.
> 
> If you do not canonicalize the form that is written to the file server
> directory entry there are two problems:
> 
> (1) The same text in Unicode can be represented by different sequences 
> of characters.  As a result you could have client A and client B both 
> create a file with the same name that can not be visually distinguished 
> by the end user.   Now which one do you open?

This problem is nothing unicode-specific, the users can easily create
file names even in plain ascii which are visually indistinguishable.
(easiest with certain fonts :)

As soon as application software can list files and let the user pick one,
it is no longer a remarkable problem in practice.

> (2) Since the directory lookups are performed using a hash table, a file 
> with the name being searched for might exist but it cannot be found 
> because the input to the hash function on client B is different than the 
> input used to create the entry on client A.

If the name is a byte sequence, this can not happen, you imply that
the file name _is_ a character string.
(Of course, applications do read user input as text - to create new files,
but most often not for opening existing files.)
Compatibility in file naming (saved at one occation should be readable
at another, possibly on another computer and by another program)
belongs at the application level. File naming compatibility does not differ
essentially from compatibility of file contents.

Any file name works if you are not typing the name but reading it
from the directory as bytes. On the other side, _any_ byte sequences,
even "interpreted as text and normalized" will have problems to be properly
displayed by programs in some locales. All the files nevertheless can stay
accessible as each one can be opened by its unique name read from
the directory.

> Storing file names as opaque octet sequences is broken in other ways. 
> Depending on the character set used on the client the file name might or 
> might not be representable since the octet sequence contains no 
> indication whether the sequence is CP437, CP850, CP1252, ISO Latin-1,
> ISO-Latin-9, UTF-7, UTF-8, etc.

This is just the result of broken practices - using limited and thus
incompatible encodings ultimately leads to breakage and no efforts
can eliminate the pain afterwards.

The most important, I think:

Applying encodings to file names (treating them as text as opposite to
byte sequences) is broken fundamentally - this can _not_ be done properly.

The same file can be opened by two processes running with different locales,
on the same computer and even at the same time.
There is hardly any information about file name encoding in an open()
system call. How does the file system know which encoding is used by
a particular process for a particular open()?

Regards,
Rune