[OpenAFS-devel] ARGH... afs outage... and same stinking clien t bug that it doesn' t ever see it...

Neulinger, Nathan nneul@umr.edu
Tue, 2 Oct 2001 13:48:20 -0500


Well, I just found the SIGXCPU thing... very interesting... 

Yes, it was the below situation, but that's still bad cause it looks like it
doesn't clean up after itself or prune the list in any way:

Host 83974e0d.7001 down = 0, LastCall Tue Oct  2 13:35:00 2001
    user id=32766,  name=anonymous, sl=Not authenticated till No Limit
      CPS-2 is []
Host 83977052.7001 down = 0, LastCall Tue Oct  2 13:21:35 2001
    user=anonymous, no current server connection
      CPS-2 is []
Host 83971b9f.7001 down = 0, LastCall Tue Oct  2 13:23:21 2001
    user=anonymous, no current server connection
      CPS-2 is []
Host 8397050d.7001 down = 0, LastCall Tue Oct  2 13:16:33 2001
    user=anonymous, no current server connection
      CPS-2 is []
    user=anonymous, no current server connection
      CPS-2 is []
Host 839706e6.7001 down = 0, LastCall Tue Oct  2 13:03:36 2001
    user=anonymous, no current server connection
      CPS-2 is []
Host 83a92a33.7001 down = 0, LastCall Tue Oct  2 13:21:46 2001
    user=anonymous, no current server connection
      CPS-2 is []
Host 83970660.7001 down = 0, LastCall Tue Oct  2 12:55:56 2001
    user=anonymous, no current server connection
      CPS-2 is []
Host 8397065f.7001 down = 0, LastCall Tue Oct  2 12:43:38 2001
    user=anonymous, no current server connection
      CPS-2 is []
Host 8397051f.7001 down = 0, LastCall Tue Oct  2 13:22:56 2001
    user=anonymous, no current server connection
      CPS-2 is []


That's just a teensy snippet from one of the servers...

What causes the file server to clean up that clients table? As it looks to
me, that list will grow without bound currently.

Is there anything that can be done here, or am I chasing a red herring?

-- Nathan

> -----Original Message-----
> From: Neulinger, Nathan [mailto:nneul@umr.edu]
> Sent: Tuesday, October 02, 2001 1:25 PM
> To: 'openafs-devel@openafs.org'
> Subject: RE: [OpenAFS-devel] ARGH... afs outage... and same stinking
> clien t bug that it doesn' t ever see it...
> 
> 
> Just a thought - does the file server maintain any state 
> information for
> authenticated users?
> 
> Could this be a case of NUM-HOSTS * NUM-USERS * 
> NUM-SEPARATE-PAGS = LARGE
> NUMBER?
> 
> I see all the code the increments/decrements CEs in viced/host.c.
> 
> Would be nice to have a kdump equivalent for the file server process. 
> 
> -- Nathan
> 
> > -----Original Message-----
> > From: Neulinger, Nathan [mailto:nneul@umr.edu]
> > Sent: Tuesday, October 02, 2001 1:04 PM
> > To: 'openafs-devel@openafs.org'
> > Subject: [OpenAFS-devel] ARGH... afs outage... and same 
> > stinking client
> > bug that it doesn' t ever see it...
> > 
> > 
> > We had another afs outage here, and once again, the same 
> client bug is
> > causing the clients to never see the failure.
> > 
> > This was the same fileserver bug that I reported a few weeks 
> > ago where it is
> > accumulating hundreds/millions of host entries:
> > 
> >              37653 host_NumHostEntries
> >                 74 host_HostBlocks
> >               1533 host_NonDeletedHosts
> >               1508 host_HostsInSameNetOrSubnet
> >                  3 host_HostsInDiffSubnet
> >                 22 host_HostsInDiffNetwork
> >            1172202 host_NumClients
> >              16058 host_ClientBlocks
> > 
> > That numclients is getting HUGE, and the file server is 
> > sucking larger and
> > larger amounts of memory. 
> > 
> > (Feel free to run xstat against afs[1-10].umr.edu. (Some are db only
> > though.) Interestingly I also see this:
> > 
> >         -1090519756 rx_bogusHost
> > 
> > (Output from xstat_fs_test afs4 1)
> > 
> > On a server that has been rebooted recently due to this 
> > problem, I saw 594
> > million on the bogusHost, and 293 thousand on the numClients. 
> > Somewhere
> > there is a bad leak. Anyone else seen anything like this? 
> > 
> > We're contemplating re-enabling periodic (maybe monthly) 
> > server restarts at
> > the moment, but would rather have a better fix.
> > 
> > As far as the client bug - which appears to occur on 
> several different
> > platforms - basically they just hang. They don't time out and 
> > see the file
> > server go down, or anything. Now - the instant I kill that file
> > server/reboot/firewall it, the clients ALL break loose 
> > immediately. The
> > problem is basically that all of the afs clients are 
> > completely hung and
> > won't respond to much of anything. This means that a single 
> > afs server going
> > down in this way negates all benefit of replicated volumes. 
> > 
> > I have never been able to reproduce this symptom by
> > suspending/firewalling/etc. a file server, the clients all see it
> > immediately. 
> > 
> > If someone can give me any ideas on how I might reproduce 
> this failure
> > symptom (i.e. dropped packets, whatever) I have a test cell 
> > that I will use
> > to see about diagnosing the client and server, but at the 
> > moment, I do not
> > have any way of reproducing the symptom. Whenever I've tried 
> > anything, the
> > client always immediately sees the failure. What situation 
> > would cause the
> > client to hang and not time out - there must be a particular 
> > location in the
> > cache manager code where that situation can occur. (Or is it 
> > a case where
> > it's semi-responding, but not enough to cause the client to 
> > break loose - if
> > so, I wonder if there is some way to cause the client to be 
> > more sensitive
> > and have a greater tendency to drop the server?)
> > 
> > -- Nathan
> > 
> > ------------------------------------------------------------
> > Nathan Neulinger                       EMail:  nneul@umr.edu
> > University of Missouri - Rolla         Phone: (573) 341-4841
> > Computing Services                       Fax: (573) 341-4216
> > _______________________________________________
> > OpenAFS-devel mailing list
> > OpenAFS-devel@openafs.org
> > https://lists.openafs.org/mailman/listinfo/openafs-devel
> > 
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>