[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

Steve Simmons scs@umich.edu
Mon, 31 Jan 2011 12:39:39 -0500


On Jan 31, 2011, at 12:17 PM, Stephen Joyce wrote:

> On Mon, 31 Jan 2011, Steve Simmons wrote:
>=20
>> We have seen similar issues. It occurs when there is a given vice =
partition where lots of clients have registered callbacks but those =
clients are no longer accessible. Not all the clients have responded =
when the 1800 second timer goes off, and the fileserver goes down =
uncleanly.
>>=20
>> We have about 235,000 volumes spread across 40 vice partitions. Our =
'fix' is a combination of lengthening that timeout to a 3600 seconds and =
keeping our vice partitions no longer than 2TB. Active partitions are =
spread roughly equally across those 40 partitions. But that's just a =
stopgap; the longer a server stays up, the more likely it accumulates =
dead callbacks.
>=20
> Assuming this is true, isn't this a good argument to keep the weekly =
server process restarts?

Weekly outages, even if only for a few minutes per, are not acceptable =
here. Doing them less frequently starts to put us into the range of the =
timeout problems above.

At the moment most of our afs service processes have run happily for 237 =
days. That alone is a strong argument for not needing weekly restarts. =
If there are memory leaks, etc, they largely aren't affecting us since=20=


We mostly do restarts when we need to do software upgrades of one sort =
or another. They are typically done in a rolling fashion - upgrade the =
hot spare(s), vos move volumes to the hot spare(s), take down the =
vacated servers and upgrade, lather, rinse, repeat. At one point we went =
two years without a general AFS shutdown. We only got away from that due =
to bugs that required us to do OS upgrades more frequently or the entire =
cell at once. Life seems generally better with respect to those issues; =
and campus' opinion of the service is better when there are no perceived =
outages.

For the curious, we're running 1.4.12 with a couple of fixes we pulled =
forward from the 1.4.13 development stream. Barring new developments, =
the next one we'll give serious consideration to is 1.6.X.=