[OpenAFS-devel] AFS vs UNICODE

Wed, 7 May 2008 11:56:03 +0200

Hi Jeffrey,

On Tue, May 06, 2008 at 06:49:38PM -0400, Jeffrey Altman wrote:
> u+openafsdev-t07O@chalmers.se wrote:
> >This problem is nothing unicode-specific, the users can easily create
> >file names even in plain ascii which are visually indistinguishable.
> >(easiest with certain fonts :)
> >
> >As soon as application software can list files and let the user pick one,
> >it is no longer a remarkable problem in practice.
> 
> This is not true since the user interfaces on each of the operating
> systems will all represent the strings to the user as the same name.
> This is not a font issue.

I never said it is a font issue.
I said this problem exists no matter what we do with Unicode.

A file named "abc" and another one named "a <backspace>bc"
are visually indistinguishable in ls output on most ttys.

> A file name from the perspective of the user is a character string.

>From the perspective of the user, file name _visual_representation_ is a character
string, but not necessarily the file name itself.
The user normally never interacts with the file system on low level, there is
always an application in between, and the applications tend to do all kinds of
transformations, e.g. stripping/adding .xxx suffixes.

> The user types in a name via the user interface and the user interface
> determines how to represent that name not the user.  If the user enters
> the name on a MacOS X system she will get a UNICODE sequence that is in
> decomposed form.  If the user enters the same name on Windows she will
> get a UNICODE sequence that is in composed form.
> 
> If the user tries to access her files from both machines she will have
> interop problems.

Not really, given that the file names are treated as byte sequences, she will be
able to open the file without any problems, just choose it from the list.

I guess the user may have harder times trying to use the contents of a file created by
some Windows application on Mac and vice versa :) It is on the application level where
compatibility must be addressed, and file naming is easy to address on that level.

Different application do not even have to agree on the exact encoding unless they
interchange the same data format, in which case they do have to have certain common
knowledge. The file system does not and can not have that knowledge.

> >(Of course, applications do read user input as text - to create new files,
> >but most often not for opening existing files.)
> >Compatibility in file naming (saved at one occation should be readable
> >at another, possibly on another computer and by another program)
> >belongs at the application level. File naming compatibility does not differ
> >essentially from compatibility of file contents.
> 
> We already have evidence to the contrary.

I may have missed the evidence? It might seem we interpret the same facts differently.

> >>Depending on the character set used on the client the file name might or 
> >>might not be representable since the octet sequence contains no 
> >>indication whether the sequence is CP437, CP850, CP1252, ISO Latin-1,
> >>ISO-Latin-9, UTF-7, UTF-8, etc.
> >
> >This is just the result of broken practices - using limited and thus
> >incompatible encodings ultimately leads to breakage and no efforts
> >can eliminate the pain afterwards.
> 
> Correct.  But with Unicode we do have the ability to eliminate the
> problems associated with (a) no normalization; (b) decomposed 
> normalization; and (c) composed normalization.

It is a problem stemming from the requirement to treat file names as text.
So you make an assumption that this requirement is very important
and try to implement it. My point is:

1. the requirement is not really necessary
2. it is unfortunately fundamentally impossible to implement the "file names are text"
   concept consequently, unless the encodings/decodings are done by the applications;
   the file system layer lacks the necessary knowledge.

> >The same file can be opened by two processes running with different 
> >locales,
> >on the same computer and even at the same time.
> >There is hardly any information about file name encoding in an open()
> >system call. How does the file system know which encoding is used by
> >a particular process for a particular open()?
> 
> There is no knowledge at the open() or CreateFile() level.   There is 
> extensive knowledge at the user interface level.

Here we both agree! So the preparation of the file name _is_ to be done by
the application which has the knowledge. If the applications on different platforms
disagree in how to apply their knowledge, it is not a problem a file system
can fix. It can try to guess and band-aid, but this is going to break other things.

You still did not answer, what happens if one application accepts user input
as several non-ascii characters in latin-1 and passes it on to open(),
and the other makes the same in another encoding?
How can the file system guess what the two users (or even the same one working
with two different applications and locales) mean and "fix" what the applications supply?

May be the band-aid you are discussing is worth the effort, but in my eyes the confusion
brought by such approach is more harmful than useful.

Best regards,
Rune