[OpenAFS-devel] AFS vs UNICODE

Roland Kuhn rkuhn@e18.physik.tu-muenchen.de
Wed, 7 May 2008 09:00:52 +0200


This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--Apple-Mail-5--91781219
Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes
Content-Transfer-Encoding: quoted-printable

Hi Jeffrey,

On 6 May 2008, at 23:00, Jeffrey Altman wrote:

> Roland Kuhn wrote:
>
>> Well, certainly. But I find it very irritating that a filesystem =20
>> should somehow interpret and _change_ a filename based on the =20
>> assumption of UTF-8 encoding, even if the filename's byte sequence =20=

>> happens to conform to the UTF-8 rules. Why bother? It's much easier =20=

>> and much more portable to regard filenames as opaque byte sequences.
>
> If you do not canonicalize the form that is written to the file server
> directory entry there are two problems:
>
> (1) The same text in Unicode can be represented by different =20
> sequences of characters.  As a result you could have client A and =20
> client B both create a file with the same name that can not be =20
> visually distinguished by the end user.   Now which one do you open?
>
Well "same" does not really mean same here. You mean that different =20
unicode character sequences may have the same visual representation, =20
but that does not change the fact that the character as well as the =20
byte sequences differ. The common case when opening a file selected by =20=

the user is based on a selection from a list in a graphical user =20
interface. If user A creates some file and user B looks at the listing =20=

of the directory and finds a file with the right name (meaning: the =20
right glyphs being displayed), then he will open it, no matter which =20
choice of code points has been used. The application or filesystem =20
does something wrong when mangling this perfectly valid name, only =20
because it would have used a different representation itself.

> (2) Since the directory lookups are performed using a hash table, a =20=

> file with the name being searched for might exist but it cannot be =20
> found because the input to the hash function on client B is =20
> different than the input used to create the entry on client A.
>
I hope the hashing at the fileserver, where it actually matters, is =20
done with the unmangled byte sequence used when creating the file. =20
That way, each and every client can find the file by the name (meaning =20=

byte sequence) it was created with. And that can easily be found by =20
listing the directory contents, as is the common case for user defined =20=

file names nowadays, see above. Any additional hashing/caching on the =20=

client should simply be transparent wrt. this.

> Storing file names as opaque octet sequences is broken in other =20
> ways. Depending on the character set used on the client the file =20
> name might or might not be representable since the octet sequence =20
> contains no indication whether the sequence is CP437, CP850, CP1252, =20=

> ISO Latin-1,
> ISO-Latin-9, UTF-7, UTF-8, etc.


Since Unix semantics do not attach something like a character encoding =20=

to file names, it is impossible for the OS or network filesystem to =20
determine this information reliably. So the only sane way is to store =20=

whatever you get, return it unaltered, and leave the representation =20
problem to the GUI application. If you start interpreting file names, =20=

this will inevitably lead to problems such as the very annoying fact =20
that I cannot store a file under a latin1-encoded name containing an =20
"=FC" under MacOS. Chances are that the user who created the file used =20=

an application that can deal with its own file name encoding...

Ciao,
                     Roland

--
Any society that would give up a little liberty to gain a little
security will deserve neither and lose both.  - Benjamin Franklin
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P+++ L+++ E(+) W+ !N K- w--- M+ !=20=

V Y+
PGP++ t+(++) 5 R+ tv-- b+ DI++ e++++ h---- y+++
------END GEEK CODE BLOCK------




--Apple-Mail-5--91781219
content-type: application/pgp-signature; x-mac-type=70674453;
	name=PGP.sig
content-description: This is a digitally signed message part
content-disposition: inline; filename=PGP.sig
content-transfer-encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkghU6UACgkQI4MWO8QIRP34ogCeLSA4+OEasYqejYc4teu8Hqaw
WisAoLJAjJHfrkMexis2rEsrXZ0bTxoT
=OjGc
-----END PGP SIGNATURE-----

--Apple-Mail-5--91781219--