[OpenAFS-win32-devel] Re : Help needed in developing disconnected operation for OpenAFS on windows
Simon Wilkinson
OpenAFS Devel <openafs-devel@openafs.org>
Tue, 4 Mar 2008 12:03:44 +0000
I'm in the process of updating the Michigan disconnected operation
code for the Unix tree, so here are my thoughts on what I'm doing
there. Bear in mind that none of this has been accepted into the tree
yet! Sorry for polluting the Windows list with Unix comments
(I've set followups to openafs-devel)
Jeffrey Altman wrote:
> Disconnected operations should not be a globally setting. That is
> acceptable for a research project that demonstrates the capability but
> it is not acceptable for real world environments in which some servers
> or cells may not be accessible while others remain accessible.
I guess this depends on what you're trying to achieve through
providing disconnected operation, and the quality of the user
experience you can provide when performing re-integration. Looking at
other disconnected systems, one of the usability challenges of Coda
is that clients can go disconnected without the user's knowledge, and
so the user can end up having to resolve integration conflicts which
aren't of their making, and which they were completely unaware of.
This tends to score badly for usability, as it violates some of the
user's fundamental assumptions. Providing a system which requires an
explicit 'go disconnected' step has the advantage that the user is
aware both of when they disconnected from the network, and when they
reconnected. This allows them to rationalise any conflict resolution
steps that they have to perform.
That's not to say that 'opportunistic' disconnection (as I'm
christening the solution you outline - where the cache manager
continues to serve files for which it had a valid callback when the
file server disappeared, without any user interaction) doesn't have
real uses - I just think that the usability challenges are far higher.
> (1) how do you ensure that you have all of the data for all of the
> files
> and directories that the user wishes to access in the cache? AFS
> caches arbitrary blocks not whole files or directories.
I'll add to this:
1a) How do you ensure that the data you have in the cache is
sufficiently recent to be of use to the client
The naive mechanism, as implemented by the Michigan code, just serves
whatever happens to be in the cache back to the user. The problem is
that, depending on the size of your cache against your normal working
set, it's possible that you might get files that are months, out of
date. The normal AFS way of resolving this is to hold callbacks for
these files - you could extend this to disconnected operation by
adding a 'pinning' functionality, where a user indicates to the cache
manager that they want a particular file to be available offline, and
the cache manager should ensure that its always up to date on the
client. However, if you attempt to hold callbacks for every file in a
users offline set, then you're likely to cause severe callback storms
with the fileserver (multiple clients hold more than the fileserver's
maximum number of callbacks - fileserver starts breaking older
callbacks, clients see callback breaks and attempt to update pinned
files, fileserver creates new callbacks for these, and round and
round we go)
The question of how we ensure acceptable recency, without making
fileserver changes, is a tricky one.
> (2) how do you synchronize read and write locks when the file
> server is
> not accessible?
It's relatively easy to maintain a list of the locks granted by the
cache manager whilst in disconnected mode, and you can ensure that
the locking protects processes running on the same machine from each
other. The issue is what you do when reconnecting. The cache manager
plays the list of locally granted locks to the fileserver, and all is
well if it grants them. But, what happens if the fileserver refuses a
lock. You can't recall locks which have already been issued, so you
can have a situation where there's a process happily writing to a
file, under what it believes is a write lock, whilst it actually has
no lock at all on the server. As I see it, there are three options 1)
Ignore the problem; 2) Fail reads and writes to that file descriptor
as soon as the lock fails; 3) 'Defer' reintegration of that file
until it is closed, and deal with the problem then.
This is a much bigger issue on Windows than Unix, though.
> (3) how do you interact with the end user to notify them of collisions
> and what do you do when there are collisions?
I'm currently implementing a collison resolution policy of "last
closer wins". Whilst this does have the potential to cause
significant data loss, it has the big advantage over more complex
resolution policies that it's explainable to, and understandable by,
the user. At the moment collisions get logged in the system log. It
would be possible to take advantage of some of the new desktop
technologies appearing for Unix to get those messages closer to the
user (although, on multi-user machines, desktop based notifications
break down)
> (5) how do you address access control issues for files that are
> offline?
The Michigan code simply disables access control when a machine goes
offline. With the Unix model, this is more acceptable - machines only
go offline with an explicit command, which can only be issued by the
super user. The super user has access to the cache contents, anyway.
However, this doesn't help with people who have implemented access
controls to protect themselves from silly mistakes.
I've got a provisional implementation of 'local' tokens which can be
used to convey CPS information from the userland to the cache
manager, but won't be usable in a connected environment. My eventual
plan is that it's possible to 'stash' access data for a particular
userid to a file, from where it can be reloaded while the cache
maneger is offline. However, as soon as you start using these you run
in to ...
> (6) how do you ensure that the file are synchronized back to file
> server
> with the same user credentials that were intended to be used when the
> files were modified?
This is tricky. I don't (yet) have a good answer to this one. At the
moment, all replays have to come from a single identity (and their
token had better be valid when reintegration starts)
Cheers,
Simon.