[OpenAFS] [1.2.7] Strange file server meltdown

Todd_DeSantis@transarc.com Todd_DeSantis@transarc.com
Fri, 13 Dec 2002 08:39:32 -0500 (EST)


Hi Russ:

> (3) Once the server goes into this failure mode, it appears to be
>     impossible to restart with bos restart.  The status of the
>     service changes in bos status (it goes to temporarily disabled),
>     but the file  server never shuts down.  bos restart works if you
>     catch the server early enough, but by the time that it has a
>     thousand blocked connections, it no longer seems to be listening.
>
>     This seems like it's a bug in the interface between bosserver
>     and the fileserver, since bos restart is often used to restart a
>     file server that's in trouble.  Is there some sort of a force
>     flag that I'm missing ?

In your other email, you mention that the host lock is being held,
etc.  During the "bos restart" the fileserver will shutdwon andthen
the bosserver will restart it once it sees that it no longer is
running. 

However, when the fileserver is shutting down, it tries to report
statistics and one of the routines being called will try to go through
the hostList.  Since the host lock is being held, this thread is
blocked.

When you got the rxdebug output from the fileserver, how many
connections were there in total ?

If you used the -rxstats flag, what did the following line look like

   1 server connections, 7 client connections, 7 peer structs, 4 call
structs, 3 free call structs

How many server connections
	 client connections

As you mention, there are chances of race conditions going through the
host lists.

Thanks

Todd