[OpenAFS-devel] ARGH... afs outage... and same stinking client bug that it doesn' t ever see it...

Neulinger, Nathan nneul@umr.edu
Tue, 2 Oct 2001 13:04:27 -0500


We had another afs outage here, and once again, the same client bug is
causing the clients to never see the failure.

This was the same fileserver bug that I reported a few weeks ago where it is
accumulating hundreds/millions of host entries:

             37653 host_NumHostEntries
                74 host_HostBlocks
              1533 host_NonDeletedHosts
              1508 host_HostsInSameNetOrSubnet
                 3 host_HostsInDiffSubnet
                22 host_HostsInDiffNetwork
           1172202 host_NumClients
             16058 host_ClientBlocks

That numclients is getting HUGE, and the file server is sucking larger and
larger amounts of memory. 

(Feel free to run xstat against afs[1-10].umr.edu. (Some are db only
though.) Interestingly I also see this:

        -1090519756 rx_bogusHost

(Output from xstat_fs_test afs4 1)

On a server that has been rebooted recently due to this problem, I saw 594
million on the bogusHost, and 293 thousand on the numClients. Somewhere
there is a bad leak. Anyone else seen anything like this? 

We're contemplating re-enabling periodic (maybe monthly) server restarts at
the moment, but would rather have a better fix.

As far as the client bug - which appears to occur on several different
platforms - basically they just hang. They don't time out and see the file
server go down, or anything. Now - the instant I kill that file
server/reboot/firewall it, the clients ALL break loose immediately. The
problem is basically that all of the afs clients are completely hung and
won't respond to much of anything. This means that a single afs server going
down in this way negates all benefit of replicated volumes. 

I have never been able to reproduce this symptom by
suspending/firewalling/etc. a file server, the clients all see it
immediately. 

If someone can give me any ideas on how I might reproduce this failure
symptom (i.e. dropped packets, whatever) I have a test cell that I will use
to see about diagnosing the client and server, but at the moment, I do not
have any way of reproducing the symptom. Whenever I've tried anything, the
client always immediately sees the failure. What situation would cause the
client to hang and not time out - there must be a particular location in the
cache manager code where that situation can occur. (Or is it a case where
it's semi-responding, but not enough to cause the client to break loose - if
so, I wonder if there is some way to cause the client to be more sensitive
and have a greater tendency to drop the server?)

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216