[OpenAFS-devel] ARGH... afs outage... and same stinking clien t bug that it doesn' t ever see it...

Tue, 2 Oct 2001 13:57:49 -0500

Looking at hosts.dump from one of our fileservers that has a large
NumClients - something interesting came up - there are duplicated entries...

troot-afs4(70)> cat hosts.dump | awk '{ print $1 }' | grep ip: | wc -l
  37694
troot-afs4(71)> cat hosts.dump | awk '{ print $1 }' | grep ip: | sort | uniq
| wc -l
   3000

Also, when I checked clients.dump on a machine that had a large NumClients,
it didn't match up sizewise. The list of clients in clients.dump was not
anywhere in the millions. So somehow, something is getting out of sync I
think. 

-- Nathan

> -----Original Message-----
> From: Neulinger, Nathan [mailto:nneul@umr.edu]
> Sent: Tuesday, October 02, 2001 1:48 PM
> To: 'openafs-devel@openafs.org'
> Subject: RE: [OpenAFS-devel] ARGH... afs outage... and same stinking
> clien t bug that it doesn' t ever see it...
> 
> 
> Well, I just found the SIGXCPU thing... very interesting... 
> 
> Yes, it was the below situation, but that's still bad cause 
> it looks like it
> doesn't clean up after itself or prune the list in any way:
> 
> Host 83974e0d.7001 down = 0, LastCall Tue Oct  2 13:35:00 2001
>     user id=32766,  name=anonymous, sl=Not authenticated till No Limit
>       CPS-2 is []
> Host 83977052.7001 down = 0, LastCall Tue Oct  2 13:21:35 2001
>     user=anonymous, no current server connection
>       CPS-2 is []
> Host 83971b9f.7001 down = 0, LastCall Tue Oct  2 13:23:21 2001
>     user=anonymous, no current server connection
>       CPS-2 is []
> Host 8397050d.7001 down = 0, LastCall Tue Oct  2 13:16:33 2001
>     user=anonymous, no current server connection
>       CPS-2 is []
>     user=anonymous, no current server connection
>       CPS-2 is []
> Host 839706e6.7001 down = 0, LastCall Tue Oct  2 13:03:36 2001
>     user=anonymous, no current server connection
>       CPS-2 is []
> Host 83a92a33.7001 down = 0, LastCall Tue Oct  2 13:21:46 2001
>     user=anonymous, no current server connection
>       CPS-2 is []
> Host 83970660.7001 down = 0, LastCall Tue Oct  2 12:55:56 2001
>     user=anonymous, no current server connection
>       CPS-2 is []
> Host 8397065f.7001 down = 0, LastCall Tue Oct  2 12:43:38 2001
>     user=anonymous, no current server connection
>       CPS-2 is []
> Host 8397051f.7001 down = 0, LastCall Tue Oct  2 13:22:56 2001
>     user=anonymous, no current server connection
>       CPS-2 is []
> 
> 
> That's just a teensy snippet from one of the servers...
> 
> What causes the file server to clean up that clients table? 
> As it looks to
> me, that list will grow without bound currently.
> 
> Is there anything that can be done here, or am I chasing a 
> red herring?
> 
> -- Nathan
> 
> > -----Original Message-----
> > From: Neulinger, Nathan [mailto:nneul@umr.edu]
> > Sent: Tuesday, October 02, 2001 1:25 PM
> > To: 'openafs-devel@openafs.org'
> > Subject: RE: [OpenAFS-devel] ARGH... afs outage... and same stinking
> > clien t bug that it doesn' t ever see it...
> > 
> > 
> > Just a thought - does the file server maintain any state 
> > information for
> > authenticated users?
> > 
> > Could this be a case of NUM-HOSTS * NUM-USERS * 
> > NUM-SEPARATE-PAGS = LARGE
> > NUMBER?
> > 
> > I see all the code the increments/decrements CEs in viced/host.c.
> > 
> > Would be nice to have a kdump equivalent for the file 
> server process. 
> > 
> > -- Nathan
> > 
> > > -----Original Message-----
> > > From: Neulinger, Nathan [mailto:nneul@umr.edu]
> > > Sent: Tuesday, October 02, 2001 1:04 PM
> > > To: 'openafs-devel@openafs.org'
> > > Subject: [OpenAFS-devel] ARGH... afs outage... and same 
> > > stinking client
> > > bug that it doesn' t ever see it...
> > > 
> > > 
> > > We had another afs outage here, and once again, the same 
> > client bug is
> > > causing the clients to never see the failure.
> > > 
> > > This was the same fileserver bug that I reported a few weeks 
> > > ago where it is
> > > accumulating hundreds/millions of host entries:
> > > 
> > >              37653 host_NumHostEntries
> > >                 74 host_HostBlocks
> > >               1533 host_NonDeletedHosts
> > >               1508 host_HostsInSameNetOrSubnet
> > >                  3 host_HostsInDiffSubnet
> > >                 22 host_HostsInDiffNetwork
> > >            1172202 host_NumClients
> > >              16058 host_ClientBlocks
> > > 
> > > That numclients is getting HUGE, and the file server is 
> > > sucking larger and
> > > larger amounts of memory. 
> > > 
> > > (Feel free to run xstat against afs[1-10].umr.edu. (Some 
> are db only
> > > though.) Interestingly I also see this:
> > > 
> > >         -1090519756 rx_bogusHost
> > > 
> > > (Output from xstat_fs_test afs4 1)
> > > 
> > > On a server that has been rebooted recently due to this 
> > > problem, I saw 594
> > > million on the bogusHost, and 293 thousand on the numClients. 
> > > Somewhere
> > > there is a bad leak. Anyone else seen anything like this? 
> > > 
> > > We're contemplating re-enabling periodic (maybe monthly) 
> > > server restarts at
> > > the moment, but would rather have a better fix.
> > > 
> > > As far as the client bug - which appears to occur on 
> > several different
> > > platforms - basically they just hang. They don't time out and 
> > > see the file
> > > server go down, or anything. Now - the instant I kill that file
> > > server/reboot/firewall it, the clients ALL break loose 
> > > immediately. The
> > > problem is basically that all of the afs clients are 
> > > completely hung and
> > > won't respond to much of anything. This means that a single 
> > > afs server going
> > > down in this way negates all benefit of replicated volumes. 
> > > 
> > > I have never been able to reproduce this symptom by
> > > suspending/firewalling/etc. a file server, the clients all see it
> > > immediately. 
> > > 
> > > If someone can give me any ideas on how I might reproduce 
> > this failure
> > > symptom (i.e. dropped packets, whatever) I have a test cell 
> > > that I will use
> > > to see about diagnosing the client and server, but at the 
> > > moment, I do not
> > > have any way of reproducing the symptom. Whenever I've tried 
> > > anything, the
> > > client always immediately sees the failure. What situation 
> > > would cause the
> > > client to hang and not time out - there must be a particular 
> > > location in the
> > > cache manager code where that situation can occur. (Or is it 
> > > a case where
> > > it's semi-responding, but not enough to cause the client to 
> > > break loose - if
> > > so, I wonder if there is some way to cause the client to be 
> > > more sensitive
> > > and have a greater tendency to drop the server?)
> > > 
> > > -- Nathan
> > > 
> > > ------------------------------------------------------------
> > > Nathan Neulinger                       EMail:  nneul@umr.edu
> > > University of Missouri - Rolla         Phone: (573) 341-4841
> > > Computing Services                       Fax: (573) 341-4216
> > > _______________________________________________
> > > OpenAFS-devel mailing list
> > > OpenAFS-devel@openafs.org
> > > https://lists.openafs.org/mailman/listinfo/openafs-devel
> > > 
> > _______________________________________________
> > OpenAFS-devel mailing list
> > OpenAFS-devel@openafs.org
> > https://lists.openafs.org/mailman/listinfo/openafs-devel
> > 
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>