[OpenAFS] Need details of callback mechanism -- questions ...
Jeffrey Hutzelman
jhutz@cmu.edu
Thu, 01 Sep 2005 20:39:32 -0400
On Thursday, September 01, 2005 10:27:46 AM -0600 Dexter 'Kim' Kimball
<dhk@ccre.com> wrote:
> If the version numbers are the same then the
> client updates the callback state and no data transfer occurs. OTOH if
> the file has been changed and the version numbers are different, the
> client receives data and then updates the callback state.
That's approximately correct. For every vnode it knows about, the client
maintains information indicating whether it has current metadata for that
vnode and, if so, for how long. A callback granted by the fileserver (for
example, as part of a FetchStatus operation) extendes the "for how long"
timer; a callback broken causes the metadata to be invalidated immediately.
Whenever the client wants to use a vnode, it checks to see whether it has
current, valid metadata. If not, it does a new FetchStatus call to update
its copy of the metadata; this also gets it a new callback.
If the operation involves reading or writing _data_, then the client must
insure it has an up-to-date copy of the data chunk it is working with. To
facilitate this, part of the metadata is a "data version" number, and each
cache chunk is tagged with the DV of the file whose contents it contains.
If the DV on a chunk does not match that in the file's metadata (which we
previously validated), then the client needs to fetch a new version of the
part it cares about.
> I'm looking for definitive answers to the following. Assume RW volumes
> throughout.
>
> 1. When a fileserver sends a BCB to a given client, does it wait for a
> response or does it send the BCB and handle responses asynchronously? I
> believe it used to wait for a response and that it no longer does so.
On a normal operation, the fileserver will attempt to break callbacks
synchronously; that means it waits for each client which has a callback on
the file being updated. The calls are sent out simultaneously (using
rx_multi), so you only have to wait for a single call timeout no matter how
many clients are holding callbacks. And, the fileserver will not waste
time trying to break a callback on a client it already knows is "down";
instead, it will add it to the delay queue.
On an operation such as a volume restore, the fileserver must break all
callbacks held by any client on that volume. In 1.2, this is done
synchronously in the fssync thread, which means the volserver has to wait
for it. In 1.4, it will be done in a dedicated "callbacks later" thread,
allowing the fileserver to respond to the fssync request immediately.
> 2. When does the fileserver begin sending the BCBs?
> a. When it begins to modify a given file -- i.e when it receives the
> write RPC and before (or simultaneously with) storing the first few bytes.
> b. When it has written the first bytes to a given file -- i.e. after
> it has stored x bytes but before receiving a "close" from the client.
> c. When it receives the close file RPC.
There is no such thing as a "close file" RPC; the fileserver is essentially
stateless and does not know which files are open on a client. On directory
operations, the callback on the directory is broken once the change has
been made. For file operations, the callback break happens after the vnode
in question has been locked, but before the file is actually updated. This
means that any clients which try to access the file after the write begins
will be guaranteed of seeing the new version, because their cached metadata
will be invalid, and the FetchStatus they do to update it will block until
the store completes and releases the lock on that vnode.
> 3. If the fileserver attempts a BCB to client X and gets no response (BCB
> fails on X), does it:
> a. Retry immediately.
> b. Wait some period of time before attempting the BCB again.
> c. (a) then (b)
The fileserver breaks callbacks by making a normal RPC (except that, as
described above, when multiple clients are involved, the RPC's are made in
parallel). If this operation times out, the client host is marked as
"down", and the callback is added to the delay queue. Further callback
breaks for this client will be shunted directly to the delay queue, until
we hear from it again. Once a callback is on the delay queue, the
fileserver will not attempt to break it again until it believes the client
is "up".
Once a client is marked "down", the fileserver will not waste any time
trying to communicate with that client until it hears from it again. The
next time that client makes an RPC, the fileserver will immedately break
any delayed callbacks it has queued for that host, before it processes the
new RPC. This insures that the host is now "up to date", and that if the
vnode on which it is making an RPC is one for which a callback was broken,
the client will process the callback break _before_ recording a new
callback as a result of the new RPC.
Periodically, the fileserver does a sweep of all "up" clients it knows
about. Any client which is holding active callbacks but has not been heard
from in 15 minutes is probed, to verify that it still exists. If the probe
fails, the fileserver marks the client as "down", just as if a callback
break has failed (except there is nothing to add to the delay queue). This
allows the fileserver to proactively discover "down" clients, instead of
waiting to time them out when it is trying to break a callback.
In addition, "up" clients which have not been heard from in over two hours,
whether or not they have outstanding callbacks, are deleted from the
fileserver's client list. If such a client is up, the fileserver
immediately makes an InitCallBackState RPC, instructing the client to
discard any callbacks it is holding from that fileserver. If the client is
not up, the RPC is skipped; if/when that client ever makes a call again,
the fileserver will note it has no record of the client and will make the
InitCallBackState call at that time.
> 4. What is the current fileserver BCB retry scheme?
There is no retry scheme. Once a callback break fails, the fileserver
discontinues all attempts to contact that client unless and until the
client makes another RPC.
-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
Sr. Research Systems Programmer
School of Computer Science - Research Computing Facility
Carnegie Mellon University - Pittsburgh, PA