[OpenAFS] mysterious afs fileserver issue

25 Oct 2001 14:14:53 +0200

"Nicholas Basila" <nbasila@bottlecapnotes.com> writes:

>     Today we had serious problem with our AFS cell. We're running
> OpenAFS 1.1.1a on three sun E220s (all running Solaris 7, 64 bit). We
> have several sparc boxes (all Ultra 10) running the AFS client (1.1.1a).
> For the last couple of days, they've been experiencing high loads ...
> init is taking more processor time than normal. We were in the process
> of tracking down the problem when suddenly, one of our three AFS servers
> (the control server, actually) had a fileserver process using about 97%
> of the cpu. The load jumped up rather high. We don't have many users on
> that server, maybe 20. I tried to restart the bos server and all the
> other servers on it, but it would hang trying to stop it. I ended up
> doing a shutdown -i6, but that also hung (trying to stop AFS, I would
> imagine). I ended up sending it into PROM from the serial console and
> synched and rebooted. The server is fine now (after it ran a salvage
> operation on its AFS partition), and the clients don't seem to be
> experiencing quite the load they were before. I didn't see anything
> noticeable in any of the AFS or system logs. Has anyone experienced
> anything like this?

I have seen this on TransARC-servers about 1 year ago, i belived they
fixed the bug, may this is a new one. In our case a very old
version(that was never released) of the arla-client triggerd the bug.

However insted of breaking out into prom you should do the following:

1) Be sure that the fileserver is really hang and not doing anything
   usefull!!!
2) Kill it (kill -1 <pid of fileserver>) or harder, kill -9 <pid of
   fileserver>

This saves your from rebooting the mashine, it also saves you from
doing fsck, however the "bosserver" will after you killed the
fileserver automaticly start "salvage" and after that a new
fileserver, the salvage kan take from a few seconds up to hours, on my
(pretty slow) filerserver with over 10.000 volumes it takes about
15-20 minutes.

It would of course be great to enable logging (kill -TSTP <filserver>
3 times), and dump all its internal datastructureres (kill -XCPU
<fileserver>) and send the logfiles (or pointers to them) to this
mailinglist so we can find the bug.

/Jimmy