[OpenAFS-devel] AFS vs UNICODE
Jeffrey Hutzelman
jhutz+@cmu.edu
Wed, 23 Jul 2008 00:54:50 -0400
--On Sunday, July 20, 2008 12:09:18 AM -0400 Jeffrey Altman
<jaltman@secure-endpoints.com> wrote:
> Mattias Pantzare wrote:
>> 2008/7/19 Jeffrey Altman <jaltman@secure-endpoints.com>:
>>> The Windows client code is correct. The question is how we are going to
>>> deal with this stuff
>>> for platforms where the process locale is not guaranteed to be UTF-8.
>>> We need to figure out
>>> how ZFS which does Unicode normalization is handling this.
>>
>> If you tell ZFS to do unicode normailzation on a filesystem you have
>> to use UTF-8.
>>
>> Search for normalization on this page:
>> http://docs.sun.com/app/docs/doc/819-2240/zfs-1m?a=view
>
> After speaking with one of the relevant developers from Sun, the NFS and
> CIFS file servers will enforce the use of UTF-8 as well if the data set
> has been tagged to be Unicode.
>
> We might be able to do something similar with volumes or directories
> that are tagged to be Unicode only.
Actually, I think there is a fairly simple behavior we can use that will do
something useful, based on a previous discussion (possibly with the same
Sun developer) about ZFS....
As you're doing now with Windows, when creating a file, use exactly what
was passed in from the upper layer. This might be UTF-8, in some arbitrary
normalization, or it might be something else.
When looking up existing names, prefer an exact octet-wise match, as you're
doing now with Windows. This will allow disambiguation of multiple
differently-normalized UTF-8 names, and will also allow lookup of filenames
in other 8-bit charsets, provided the application and OS give you the name
exactly as it appears (this is not as unlikely as it sounds; often the name
given you will be one that was selected from a GUI display of names you
provided, and so will be exactly correct even if nothing knows what charset
was really in use).
If the exact lookup fails, but the requested name is valid UTF-8, try a
normalized lookup. Of course, you can only compare the name against
directory entries that are also valid UTF-8; other entries will fail.
If you do this, you get these properties...
- When ASCII is used, everything will Just Work(tm)
- When UTF-8 is used, everything will Just Work(tm)
- When legacy 8-bit charsets are used, things will always work if everyone
agrees on the charset in use, and will often work well enough even if
not. This is no worse than the situation today.
What you do _not_ get is the ability to pass in a UTF-8 filename and have a
lookup succeed when the filename is actually represented in a legacy
charset, or vice versa. This essentially means that transition from a
legacy 8-bit character set to UTF-8 will be painful.
In practice, I think we can ease this pain by providing mechanisms to allow
server admins, client admins, client users, and/or content owners to
advertise a legacy charset that is in use, probably at the volume, server,
or cell level. This information can be used by clients to convert between
UTF-8 and the advertised legacy charset for the purpose of doing lookups.
Of course, even in this case, new names should always be stored exactly as
given, without conversion.
-- Jeff