[OpenAFS-devel] AFS vs UNICODE

Tue, 06 May 2008 18:49:38 -0400

u+openafsdev-t07O@chalmers.se wrote:
> This problem is nothing unicode-specific, the users can easily create
> file names even in plain ascii which are visually indistinguishable.
> (easiest with certain fonts :)
> 
> As soon as application software can list files and let the user pick one,
> it is no longer a remarkable problem in practice.

This is not true since the user interfaces on each of the operating
systems will all represent the strings to the user as the same name.
This is not a font issue.

>> (2) Since the directory lookups are performed using a hash table, a file 
>> with the name being searched for might exist but it cannot be found 
>> because the input to the hash function on client B is different than the 
>> input used to create the entry on client A.
> 
> If the name is a byte sequence, this can not happen, you imply that
> the file name _is_ a character string.

A file name from the perspective of the user is a character string.
The user types in a name via the user interface and the user interface
determines how to represent that name not the user.  If the user enters
the name on a MacOS X system she will get a UNICODE sequence that is in
decomposed form.  If the user enters the same name on Windows she will
get a UNICODE sequence that is in composed form.

If the user tries to access her files from both machines she will have
interop problems.

> (Of course, applications do read user input as text - to create new files,
> but most often not for opening existing files.)
> Compatibility in file naming (saved at one occation should be readable
> at another, possibly on another computer and by another program)
> belongs at the application level. File naming compatibility does not differ
> essentially from compatibility of file contents.

We already have evidence to the contrary.

>> Storing file names as opaque octet sequences is broken in other ways. 
>> Depending on the character set used on the client the file name might or 
>> might not be representable since the octet sequence contains no 
>> indication whether the sequence is CP437, CP850, CP1252, ISO Latin-1,
>> ISO-Latin-9, UTF-7, UTF-8, etc.
> 
> This is just the result of broken practices - using limited and thus
> incompatible encodings ultimately leads to breakage and no efforts
> can eliminate the pain afterwards.

Correct.  But with Unicode we do have the ability to eliminate the
problems associated with (a) no normalization; (b) decomposed 
normalization; and (c) composed normalization.

> The most important, I think:
> 
> Applying encodings to file names (treating them as text as opposite to
> byte sequences) is broken fundamentally - this can _not_ be done properly.

I disagree.

> The same file can be opened by two processes running with different locales,
> on the same computer and even at the same time.
> There is hardly any information about file name encoding in an open()
> system call. How does the file system know which encoding is used by
> a particular process for a particular open()?

There is no knowledge at the open() or CreateFile() level.   There is 
extensive knowledge at the user interface level.

Jeffrey Altman