[OpenAFS] [1.2.7] Strange file server meltdown

14 Dec 2002 22:49:08 +0100

Russ Allbery <rra@stanford.edu> writes:

> (3) Once the server goes into this failure mode, it appears to be
>     impossible to restart with bos restart.  The status of the service
>     changes in bos status (it goes to temporarily disabled), but the file
>     server never shuts down.  bos restart works if you catch the server
>     early enough, but by the time that it has a thousand blocked
>     connections, it no longer seems to be listening.
> 
>     This seems like it's a bug in the interface between bosserver and the
>     fileserver, since bos restart is often used to restart a file server
>     that's in trouble.  Is there some sort of a force flag that I'm
>     missing?

Very annoying yes, we just kill the fileserver in case of a meltdown.

> (4) Once the server goes into this failure mode, I would have expected
>     clients accessing replicated volumes on that server to fall over to
>     other replica sites, but they don't.  From the client perspective, the
>     server connection ends up in waiting_for_process for basically
>     forever.  Some client processes seem to just wait forever for it;
>     others seem to time out, but that timeout doesn't apparently turn into
>     a recognition that the file server is down, and the next time the same
>     volume is accessed, the client goes back to waiting on that file
>     server again.

I think the problem is that the fileserver is not down, just infinite slow.
Arla-clients have the same problem with not failing over to other replicas,
this behavior however forces us to kill the meltdown fileserver to keep at
least the rest of the cell working.

However i have seen the behavior above 2 times.
On OpenAFS 1.2.6 on Tru64 5.0a

I was able to give the fileserver an XCPU-signal and i still have the dump
of the internal datastuctures while the fileserver was "hanging". If
somebody is intressted i could put them on the web. I also have the logs,
but they are not intressting at all.

Whe load of the fileservers was in both cases almost 0.

Next time this happens i will fist give an "kill -XCPU" and then an
"kill -SIGSEGV" to the filesserver, maybe we can look into the core and find
out something usefull then.

/Jimmy