[OpenAFS-devel] AFS vs UNICODE

Wed, 23 Jul 2008 00:54:50 -0400

--On Sunday, July 20, 2008 12:09:18 AM -0400 Jeffrey Altman 
<jaltman@secure-endpoints.com> wrote:

> Mattias Pantzare wrote:
>> 2008/7/19 Jeffrey Altman <jaltman@secure-endpoints.com>:
>>> The Windows client code is correct.  The question is how we are going to
>>> deal with this stuff
>>> for platforms where the process locale is not guaranteed to be UTF-8.
>>> We need to figure out
>>> how ZFS which does Unicode normalization is handling this.
>>
>> If you tell ZFS to do unicode normailzation on a filesystem you have
>> to use UTF-8.
>>
>> Search for normalization on this page:
>> http://docs.sun.com/app/docs/doc/819-2240/zfs-1m?a=view
>
> After speaking with one of the relevant developers from Sun, the NFS and
> CIFS file servers will enforce the use of UTF-8 as well if the data set
> has been tagged to be Unicode.
>
> We might be able to do something similar with volumes or directories
> that are tagged to be Unicode only.

Actually, I think there is a fairly simple behavior we can use that will do 
something useful, based on a previous discussion (possibly with the same 
Sun developer) about ZFS....

As you're doing now with Windows, when creating a file, use exactly what 
was passed in from the upper layer.  This might be UTF-8, in some arbitrary 
normalization, or it might be something else.

When looking up existing names, prefer an exact octet-wise match, as you're 
doing now with Windows.  This will allow disambiguation of multiple 
differently-normalized UTF-8 names, and will also allow lookup of filenames 
in other 8-bit charsets, provided the application and OS give you the name 
exactly as it appears (this is not as unlikely as it sounds; often the name 
given you will be one that was selected from a GUI display of names you 
provided, and so will be exactly correct even if nothing knows what charset 
was really in use).

If the exact lookup fails, but the requested name is valid UTF-8, try a 
normalized lookup.  Of course, you can only compare the name against 
directory entries that are also valid UTF-8; other entries will fail.

If you do this, you get these properties...

- When ASCII is used, everything will Just Work(tm)
- When UTF-8 is used, everything will Just Work(tm)
- When legacy 8-bit charsets are used, things will always work if everyone
  agrees on the charset in use, and will often work well enough even if
  not. This is no worse than the situation today.

What you do _not_ get is the ability to pass in a UTF-8 filename and have a 
lookup succeed when the filename is actually represented in a legacy 
charset, or vice versa.  This essentially means that transition from a 
legacy 8-bit character set to UTF-8 will be painful.

In practice, I think we can ease this pain by providing mechanisms to allow 
server admins, client admins, client users, and/or content owners to 
advertise a legacy charset that is in use, probably at the volume, server, 
or cell level.  This information can be used by clients to convert between 
UTF-8 and the advertised legacy charset for the purpose of doing lookups. 
Of course, even in this case, new names should always be stored exactly as 
given, without conversion.

-- Jeff