[OpenAFS-devel] AFS vs UNICODE

Roland Kuhn rkuhn@e18.physik.tu-muenchen.de
Wed, 7 May 2008 09:27:00 +0200


This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--Apple-Mail-6--90213414
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit

Hi Jeffrey!

On 7 May 2008, at 00:49, Jeffrey Altman wrote:

> u+openafsdev-t07O@chalmers.se wrote:
>> This problem is nothing unicode-specific, the users can easily create
>> file names even in plain ascii which are visually indistinguishable.
>> (easiest with certain fonts :)
>> As soon as application software can list files and let the user  
>> pick one,
>> it is no longer a remarkable problem in practice.
>
> This is not true since the user interfaces on each of the operating
> systems will all represent the strings to the user as the same name.
> This is not a font issue.
>
And it is also not a filesystem issue. I agree that there is a  
problem, but I think we differ concerning the level on which it should  
be solved.

>>> (2) Since the directory lookups are performed using a hash table,  
>>> a file with the name being searched for might exist but it cannot  
>>> be found because the input to the hash function on client B is  
>>> different than the input used to create the entry on client A.
>> If the name is a byte sequence, this can not happen, you imply that
>> the file name _is_ a character string.
>
> A file name from the perspective of the user is a character string.
> The user types in a name via the user interface and the user interface
> determines how to represent that name not the user.  If the user  
> enters
> the name on a MacOS X system she will get a UNICODE sequence that is  
> in
> decomposed form.  If the user enters the same name on Windows she will
> get a UNICODE sequence that is in composed form.
>
> If the user tries to access her files from both machines she will have
> interop problems.
>
I beg to differ: the representation of the file name will differ  
according to where the file was created, but accesses afterwards  
_must_ work nevertheless. Each system can read the correct  
representation from the directory to be able to open the file.

>> (Of course, applications do read user input as text - to create new  
>> files,
>> but most often not for opening existing files.)
>> Compatibility in file naming (saved at one occation should be  
>> readable
>> at another, possibly on another computer and by another program)
>> belongs at the application level. File naming compatibility does  
>> not differ
>> essentially from compatibility of file contents.
>
> We already have evidence to the contrary.
>
Well, there are broken operating systems as well as broken  
applications. Let's not complement that by broken filesystems.

>>> Storing file names as opaque octet sequences is broken in other  
>>> ways. Depending on the character set used on the client the file  
>>> name might or might not be representable since the octet sequence  
>>> contains no indication whether the sequence is CP437, CP850,  
>>> CP1252, ISO Latin-1,
>>> ISO-Latin-9, UTF-7, UTF-8, etc.
>> This is just the result of broken practices - using limited and thus
>> incompatible encodings ultimately leads to breakage and no efforts
>> can eliminate the pain afterwards.
>
> Correct.  But with Unicode we do have the ability to eliminate the
> problems associated with (a) no normalization; (b) decomposed  
> normalization; and (c) composed normalization.
>
How do you know you're dealing with Unicode in the first place?  
Imagine a latin1 file name which incidentally does not violate the  
UTF-8 rules, but happens to be not normalized. Normalizing it will  
simply destroy it.

>> The same file can be opened by two processes running with different  
>> locales,
>> on the same computer and even at the same time.
>> There is hardly any information about file name encoding in an open()
>> system call. How does the file system know which encoding is used by
>> a particular process for a particular open()?
>
> There is no knowledge at the open() or CreateFile() level.   There  
> is extensive knowledge at the user interface level.
>
Exactly. So that is the place where this problem is to be solved.

Ciao,
                     Roland

--
Any society that would give up a little liberty to gain a little
security will deserve neither and lose both.  - Benjamin Franklin
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P+++ L+++ E(+) W+ !N K- w--- M+ ! 
V Y+
PGP++ t+(++) 5 R+ tv-- b+ DI++ e++++ h---- y+++
------END GEEK CODE BLOCK------




--Apple-Mail-6--90213414
content-type: application/pgp-signature; x-mac-type=70674453;
	name=PGP.sig
content-description: This is a digitally signed message part
content-disposition: inline; filename=PGP.sig
content-transfer-encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkghWcQACgkQI4MWO8QIRP2Q2wCgqtJaO93t0e2DkMhqpQS5MCGP
xdsAn2Lw//NvrRDoCQhDWOs5Cc8NZjAx
=wUfy
-----END PGP SIGNATURE-----

--Apple-Mail-6--90213414--