[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

Mon, 31 Jan 2011 11:54:24 -0500

On Jan 28, 2011, at 1:58 PM, Jeff Blaine wrote:

> On 1/28/2011 1:52 PM, Derrick Brashear wrote:
>> did shutdown perchance take 30min?
>=20
> Yes.  I found this in BosLog.old just now:
>=20
> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown =
within 1800 seconds
> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9

We have seen similar issues. It occurs when there is a given vice =
partition where lots of clients have registered callbacks but those =
clients are no longer accessible. Not all the clients have responded =
when the 1800 second timer goes off, and the fileserver goes down =
uncleanly.

We have about 235,000 volumes spread across 40 vice partitions. Our =
'fix' is a combination of lengthening that timeout to a 3600 seconds and =
keeping our vice partitions no longer than 2TB. Active partitions are =
spread roughly equally across those 40 partitions. But that's just a =
stopgap; the longer a server stays up, the more likely it accumulates =
dead callbacks.

Two things I suspect but don't know for certain:

Dynamic attach may help this a bit, simply because there will be fewer =
volumes attached and therefore fewer to detatch. I plan on trying this =
out soon. :-)

I haven't read the code, but by observing the logfiles during a shutdown =
time it appears that fs shutdown break callbacks in a single-threaded =
manner per partition. This could probably be parallelized; simple =
thought experiments say X parallel callback breaks would result in run =
time T reduced to T/X.