[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

Steve Simmons scs@umich.edu
Mon, 31 Jan 2011 12:45:41 -0500


On Jan 31, 2011, at 12:36 PM, Andrew Deason wrote:

> On Mon, 31 Jan 2011 11:54:24 -0500
> Steve Simmons <scs@umich.edu> wrote:
>=20
>>> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
>>> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
>>> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
>>> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to =
shutdown within 1800 seconds
>>> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9
>>=20
>> We have seen similar issues. It occurs when there is a given vice
>> partition where lots of clients have registered callbacks but those
>> clients are no longer accessible. Not all the clients have responded
>> when the 1800 second timer goes off, and the fileserver goes down
>> uncleanly.
>=20
> Also, in this specific case, it may not be just that shutting down
> volumes took too long. 1.4.11 has known problems that can cause this
> (e.g. the host list gets a loop in it, and something spins forever
> trying to traverse the whole list).

Yeah, we got seriously bit by that bug. But not just on shutdowns; =
eventually the list would be so corrupt the processes would actually =
crash. Dan Hyde spent a lot of time on that; it's why we're running =
1.4.12 with a couple of patches currently. 'Fixing' that bug by regular =
server restarts is an argument for those restarts. But we were seeing =
the 1800 second timeout on shutdown at least back to 1.4.8. Based on our =
experience with earlier versions, the host list corruption issue didn't =
surface until post-1.4.8. Or at least, not as badly.

Steve=