[OpenAFS-devel] Re: sanity check: client vldb cache time

Tue, 28 Jan 2014 10:01:10 -0500

On Fri, 2014-01-24 at 16:02 -0600, Andrew Deason wrote:

> Waiting for unreachable clients is not so bad if we can multi-call them
> in large batches. There was an improvement to rx_multi on master that
> (at least allegedly) makes it feasible to do this in much larger
> batches, possibly on the order of every single client host at once (or
> at least ~10k, something like that). There's still some delay there, but
> I think that's up to the administrator; the delay may be okay for you.

A delay equivalent to how long it takes to time out an Rx connection is
fine.  A delay that last a substantial fraction of the 30 minutes the
bosserver will wait before sending SIGKILL is not fine.

> As for client hosts the fileserver hasn't forgotten about, that does
> take 2 hours (excepting rarer cases like where the fileserver runs out
> of callback space). If the client is capped at ~2 hours after last
> contacting the fileserver for caching a VLDB entry, that seems like we'd
> catch almost every client.

Hm.  Yes, if the clients actually do expire entries, then notification
at shutdown is really only need to inform those clients for whom the
volcache entry would still be valid post-shutdown.  So missing the older
ones seems fine.

In fact, the right thing here is probably for the expiration time of the
volcache entry to be bounded by the latest expiration time of any
callback issued on that volume.  So if the volcache entry is not
expired, we have an outstanding callback, and the fileserver _can't_
just forget about that client (at a minimum, it has to break the
callbacks).

At which point, if you have a fileserver breaking callbacks and
discarding clients due to running out of callback space, you should be
retuning the fileserver to have more callback space or issue shorter
callback times.

> It's not guaranteed to catch everyone, but I think the intention here is
> just to make it likely, as oppposed to now where every client would
> definitely be broken for a significant amount of time (unless you take
> the manual steps to work around it).

Yup; this seems like a good idea.

-- Jeff