[OpenAFS-devel] Re : Help needed in developing disconnected operation for OpenAFS on windows

Tue, 4 Mar 2008 12:03:44 +0000

I'm in the process of updating the Michigan disconnected operation  
code for the Unix tree, so here are my thoughts on what I'm doing  
there. Bear in mind that none of this has been accepted into the tree  
yet! Sorry for polluting the Windows list with Unix comments

(I've set followups to openafs-devel)

Jeffrey Altman wrote:

> Disconnected operations should not be a globally setting.  That is
> acceptable for a research project that demonstrates the capability but
> it is not acceptable for real world environments in which some servers
> or cells may not be accessible while others remain accessible.

I guess this depends on what you're trying to achieve through  
providing disconnected operation, and the quality of the user  
experience you can provide when performing re-integration. Looking at  
other disconnected systems, one of the usability challenges of Coda  
is that clients can go disconnected without the user's knowledge, and  
so the user can end up having to resolve integration conflicts which  
aren't of their making, and which they were completely unaware of.  
This tends to score badly for usability, as it violates some of the  
user's fundamental assumptions. Providing a system which requires an  
explicit 'go disconnected' step has the advantage that the user is  
aware both of when they disconnected from the network, and when they  
reconnected. This allows them to rationalise any conflict resolution  
steps that they have to perform.

That's not to say that 'opportunistic' disconnection (as I'm  
christening the solution you outline - where the cache manager  
continues to serve files for which it had a valid callback when the  
file server disappeared, without any user interaction) doesn't have  
real uses - I just think that the usability challenges are far higher.

> (1) how do you ensure that you have all of the data for all of the  
> files
> and directories that the user wishes to access in the cache?   AFS
> caches arbitrary blocks not whole files or directories.

I'll add to this:

1a) How do you ensure that the data you have in the cache is  
sufficiently recent to be of use to the client

The naive mechanism, as implemented by the Michigan code, just serves  
whatever happens to be in the cache back to the user. The problem is  
that, depending on the size of your cache against your normal working  
set, it's possible that you might get files that are months, out of  
date. The normal AFS way of resolving this is to hold callbacks for  
these files - you could extend this to disconnected operation by  
adding a 'pinning' functionality, where a user indicates to the cache  
manager that they want a particular file to be available offline, and  
the cache manager should ensure that its always up to date on the  
client. However, if you attempt to hold callbacks for every file in a  
users offline set, then you're likely to cause severe callback storms  
with the fileserver (multiple clients hold more than the fileserver's  
maximum number of callbacks - fileserver starts breaking older  
callbacks, clients see callback breaks and attempt to update pinned  
files, fileserver creates new callbacks for these, and round and  
round we go)

The question of how we ensure acceptable recency, without making  
fileserver changes, is a tricky one.

> (2) how do you synchronize read and write locks when the file  
> server is
> not accessible?

It's relatively easy to maintain a list of the locks granted by the  
cache manager whilst in disconnected mode, and you can ensure that  
the locking protects processes running on the same machine from each  
other. The issue is what you do when reconnecting. The cache manager  
plays the list of locally granted locks to the fileserver, and all is  
well if it grants them. But, what happens if the fileserver refuses a  
lock. You can't recall locks which have already been issued, so you  
can have a situation where there's a process happily writing to a  
file, under what it believes is a write lock, whilst it actually has  
no lock at all on the server. As I see it, there are three options 1)  
Ignore the problem; 2) Fail reads and writes to that file descriptor  
as soon as the lock fails; 3) 'Defer' reintegration of that file  
until it is closed, and deal with the problem then.

This is a much bigger issue on Windows than Unix, though.

> (3) how do you interact with the end user to notify them of collisions
> and what do you do when there are collisions?

I'm currently implementing a collison resolution policy of "last  
closer wins". Whilst this does have the potential to cause  
significant data loss, it has the big advantage over more complex  
resolution policies that it's explainable to, and understandable by,  
the user. At the moment collisions get logged in the system log. It  
would be possible to take advantage of some of the new desktop  
technologies appearing for Unix to get those messages closer to the  
user (although, on multi-user machines, desktop based notifications  
break down)

> (5) how do you address access control issues for files that are  
> offline?

The Michigan code simply disables access control when a machine goes  
offline. With the Unix model, this is more acceptable - machines only  
go offline with an explicit command, which can only be issued by the  
super user. The super user has access to the cache contents, anyway.  
However, this doesn't help with people who have implemented access  
controls to protect themselves from silly mistakes.

I've got a provisional implementation of 'local' tokens which can be  
used to convey CPS information from the userland to the cache  
manager, but won't be usable in a connected environment. My eventual  
plan is that it's possible to 'stash' access data for a particular  
userid to a file, from where it can be reloaded while the cache  
maneger is offline. However, as soon as you start using these you run  
in to ...

> (6) how do you ensure that the file are synchronized back to file  
> server
> with the same user credentials that were intended to be used when the
> files were modified?

This is tricky. I don't (yet) have a good answer to this one. At the  
moment, all replays have to come from a single identity (and their  
token had better be valid when reintegration starts)

Cheers,

Simon.