[OpenAFS-devel] ARGH... afs outage... and same stinking clien t bug that it doesn' t ever see it...

Neulinger, Nathan nneul@umr.edu
Tue, 2 Oct 2001 13:25:26 -0500


Just a thought - does the file server maintain any state information for
authenticated users?

Could this be a case of NUM-HOSTS * NUM-USERS * NUM-SEPARATE-PAGS = LARGE
NUMBER?

I see all the code the increments/decrements CEs in viced/host.c.

Would be nice to have a kdump equivalent for the file server process. 

-- Nathan

> -----Original Message-----
> From: Neulinger, Nathan [mailto:nneul@umr.edu]
> Sent: Tuesday, October 02, 2001 1:04 PM
> To: 'openafs-devel@openafs.org'
> Subject: [OpenAFS-devel] ARGH... afs outage... and same 
> stinking client
> bug that it doesn' t ever see it...
> 
> 
> We had another afs outage here, and once again, the same client bug is
> causing the clients to never see the failure.
> 
> This was the same fileserver bug that I reported a few weeks 
> ago where it is
> accumulating hundreds/millions of host entries:
> 
>              37653 host_NumHostEntries
>                 74 host_HostBlocks
>               1533 host_NonDeletedHosts
>               1508 host_HostsInSameNetOrSubnet
>                  3 host_HostsInDiffSubnet
>                 22 host_HostsInDiffNetwork
>            1172202 host_NumClients
>              16058 host_ClientBlocks
> 
> That numclients is getting HUGE, and the file server is 
> sucking larger and
> larger amounts of memory. 
> 
> (Feel free to run xstat against afs[1-10].umr.edu. (Some are db only
> though.) Interestingly I also see this:
> 
>         -1090519756 rx_bogusHost
> 
> (Output from xstat_fs_test afs4 1)
> 
> On a server that has been rebooted recently due to this 
> problem, I saw 594
> million on the bogusHost, and 293 thousand on the numClients. 
> Somewhere
> there is a bad leak. Anyone else seen anything like this? 
> 
> We're contemplating re-enabling periodic (maybe monthly) 
> server restarts at
> the moment, but would rather have a better fix.
> 
> As far as the client bug - which appears to occur on several different
> platforms - basically they just hang. They don't time out and 
> see the file
> server go down, or anything. Now - the instant I kill that file
> server/reboot/firewall it, the clients ALL break loose 
> immediately. The
> problem is basically that all of the afs clients are 
> completely hung and
> won't respond to much of anything. This means that a single 
> afs server going
> down in this way negates all benefit of replicated volumes. 
> 
> I have never been able to reproduce this symptom by
> suspending/firewalling/etc. a file server, the clients all see it
> immediately. 
> 
> If someone can give me any ideas on how I might reproduce this failure
> symptom (i.e. dropped packets, whatever) I have a test cell 
> that I will use
> to see about diagnosing the client and server, but at the 
> moment, I do not
> have any way of reproducing the symptom. Whenever I've tried 
> anything, the
> client always immediately sees the failure. What situation 
> would cause the
> client to hang and not time out - there must be a particular 
> location in the
> cache manager code where that situation can occur. (Or is it 
> a case where
> it's semi-responding, but not enough to cause the client to 
> break loose - if
> so, I wonder if there is some way to cause the client to be 
> more sensitive
> and have a greater tendency to drop the server?)
> 
> -- Nathan
> 
> ------------------------------------------------------------
> Nathan Neulinger                       EMail:  nneul@umr.edu
> University of Missouri - Rolla         Phone: (573) 341-4841
> Computing Services                       Fax: (573) 341-4216
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>