[OpenAFS-devel] AFS vs UNICODE

Wed, 7 May 2008 14:49:03 +0200

Jeffrey,

On Wed, May 07, 2008 at 07:24:31AM -0400, Jeffrey Altman wrote:
> u+openafsdev-t07O@chalmers.se wrote:
> >You still did not answer, what happens if one application accepts user 
> >input
> >as several non-ascii characters in latin-1 and passes it on to open(),
> >and the other makes the same in another encoding?
> >How can the file system guess what the two users (or even the same one 
> >working
> >with two different applications and locales) mean and "fix" what the 
> >applications supply?
> 
> What happens is exactly what happens today.

What happens today? I guess it does not work at all. It can not work.
It will never work. The only solution is "all application use the same
universal text encoding" otherwise the file system can not know
the semantics of the file name data passed on via a system call.

Given the named solution, there is no need to do anything in the file system,
we can (continue to?) treat this data as raw bytes.

> UTF-8 is ISO 2022 compatible.  An ISO Latin-1 sequence is already a 
> normalized string.

(you are technically correct about latin1, but) The actual question was
about "two encodings which are not compatible with each other" for the
purposes of presentation of the actual "character string".

> Interoperability between heterogeneous operating systems requires common 
> interfaces.  AFS in particular requires commonality if we are ever going
> to support internationalized cell names and volume names.

Treating file names as text

1. does not lead to a common interface by itself
2. is inherently impossible to accomplish, except in special cases (even if
special cases like "all processes on this computer are using the same
encoding" are quite common)

The problem with treating file names as text exists both inside an isolated
computer and in a distributed scenario, there is no difference in that
respect.

> We will do so by adopting UNICODE as the character set and we will agree
> on a standard UNICODE encoding and normalization.

File systems can not "agree on encoding and normalization",
they do not have the information necessary to properly encode and normalize.

> Operating systems that do not support a standard locale will continue to 
> treat file names as octet sequences but will provide a degraded user 
> experience.

There is no "standard locale" :) - locales exists to give each user
the best and inherently different for different users experience,
but I understand that this is not your point.
I am all for a "standard encoding" but I can not agree that the filesystem
is a place to implement band-aids for certain two operating systems'
lack of cooperation. Why not treat them like the rest?

If the file names are to be treated as text by AFS, it will help
MacOS users to feel relaxed, but also it will inevitably break things
for users of some other platforms, won't it?

This is not compatible with globality, not in my eyes.

> will be true for later versions of Linux and Solaris.  Just about 
> everyone is moving towards UTF-8 based locales and modern day file 
> systems assume UNICODE for file names.

I see a combination of unrelated statements:

"everyone is moving towards UTF-8 based locales" - right
"modern day file systems assume UNICODE for file names" - wrong,
impossible on posix-compliant OSs, or let us ignore their existence? :)

"UNICODE for file names" does not guarantee interoperability,
as was the origin of this discussion. A workaround for MacOS and Windows
will not solve it for other systems. Even if some systems decide to use
Unicode for textual file names, they may have their own rationales
for choosing a different Unicode encoding.

> Assuming that the file name is just a sequence of bytes works well
> on a single standalone machine.

Alas, this is not true :) There are different processes
and users on a single machine...

> It does not provide a reasonable
> experience for end users for network based protocols whether it be
> a file system protocol, FTP, SSH, etc.

Interoperability is needed both in the local and the global scenarios,
but it is only applications who can fix it - filesystems just can't.

There are less visible interoperability problems on standalone
machines as they often run homogenous software and have homogenous users.
Unfortunately they are not immune from the same encoding headache
as networked ones.

I believe I understand your point of view and it has certain practical sense.

Nevertheless, the concept of "textual" file names is a problematic and old
design decision originally by Microsoft (?), now having a chance to influence
all computers using AFS. This is what I'd wish to avoid.

Sincere and best regards,
Rune