[OpenAFS-devel] ARGH... afs outage... and same stinking clien
t bug that it doesn' t ever see it...
Neulinger, Nathan
nneul@umr.edu
Tue, 2 Oct 2001 13:57:49 -0500
Looking at hosts.dump from one of our fileservers that has a large
NumClients - something interesting came up - there are duplicated entries...
troot-afs4(70)> cat hosts.dump | awk '{ print $1 }' | grep ip: | wc -l
37694
troot-afs4(71)> cat hosts.dump | awk '{ print $1 }' | grep ip: | sort | uniq
| wc -l
3000
Also, when I checked clients.dump on a machine that had a large NumClients,
it didn't match up sizewise. The list of clients in clients.dump was not
anywhere in the millions. So somehow, something is getting out of sync I
think.
-- Nathan
> -----Original Message-----
> From: Neulinger, Nathan [mailto:nneul@umr.edu]
> Sent: Tuesday, October 02, 2001 1:48 PM
> To: 'openafs-devel@openafs.org'
> Subject: RE: [OpenAFS-devel] ARGH... afs outage... and same stinking
> clien t bug that it doesn' t ever see it...
>
>
> Well, I just found the SIGXCPU thing... very interesting...
>
> Yes, it was the below situation, but that's still bad cause
> it looks like it
> doesn't clean up after itself or prune the list in any way:
>
> Host 83974e0d.7001 down = 0, LastCall Tue Oct 2 13:35:00 2001
> user id=32766, name=anonymous, sl=Not authenticated till No Limit
> CPS-2 is []
> Host 83977052.7001 down = 0, LastCall Tue Oct 2 13:21:35 2001
> user=anonymous, no current server connection
> CPS-2 is []
> Host 83971b9f.7001 down = 0, LastCall Tue Oct 2 13:23:21 2001
> user=anonymous, no current server connection
> CPS-2 is []
> Host 8397050d.7001 down = 0, LastCall Tue Oct 2 13:16:33 2001
> user=anonymous, no current server connection
> CPS-2 is []
> user=anonymous, no current server connection
> CPS-2 is []
> Host 839706e6.7001 down = 0, LastCall Tue Oct 2 13:03:36 2001
> user=anonymous, no current server connection
> CPS-2 is []
> Host 83a92a33.7001 down = 0, LastCall Tue Oct 2 13:21:46 2001
> user=anonymous, no current server connection
> CPS-2 is []
> Host 83970660.7001 down = 0, LastCall Tue Oct 2 12:55:56 2001
> user=anonymous, no current server connection
> CPS-2 is []
> Host 8397065f.7001 down = 0, LastCall Tue Oct 2 12:43:38 2001
> user=anonymous, no current server connection
> CPS-2 is []
> Host 8397051f.7001 down = 0, LastCall Tue Oct 2 13:22:56 2001
> user=anonymous, no current server connection
> CPS-2 is []
>
>
> That's just a teensy snippet from one of the servers...
>
> What causes the file server to clean up that clients table?
> As it looks to
> me, that list will grow without bound currently.
>
> Is there anything that can be done here, or am I chasing a
> red herring?
>
> -- Nathan
>
> > -----Original Message-----
> > From: Neulinger, Nathan [mailto:nneul@umr.edu]
> > Sent: Tuesday, October 02, 2001 1:25 PM
> > To: 'openafs-devel@openafs.org'
> > Subject: RE: [OpenAFS-devel] ARGH... afs outage... and same stinking
> > clien t bug that it doesn' t ever see it...
> >
> >
> > Just a thought - does the file server maintain any state
> > information for
> > authenticated users?
> >
> > Could this be a case of NUM-HOSTS * NUM-USERS *
> > NUM-SEPARATE-PAGS = LARGE
> > NUMBER?
> >
> > I see all the code the increments/decrements CEs in viced/host.c.
> >
> > Would be nice to have a kdump equivalent for the file
> server process.
> >
> > -- Nathan
> >
> > > -----Original Message-----
> > > From: Neulinger, Nathan [mailto:nneul@umr.edu]
> > > Sent: Tuesday, October 02, 2001 1:04 PM
> > > To: 'openafs-devel@openafs.org'
> > > Subject: [OpenAFS-devel] ARGH... afs outage... and same
> > > stinking client
> > > bug that it doesn' t ever see it...
> > >
> > >
> > > We had another afs outage here, and once again, the same
> > client bug is
> > > causing the clients to never see the failure.
> > >
> > > This was the same fileserver bug that I reported a few weeks
> > > ago where it is
> > > accumulating hundreds/millions of host entries:
> > >
> > > 37653 host_NumHostEntries
> > > 74 host_HostBlocks
> > > 1533 host_NonDeletedHosts
> > > 1508 host_HostsInSameNetOrSubnet
> > > 3 host_HostsInDiffSubnet
> > > 22 host_HostsInDiffNetwork
> > > 1172202 host_NumClients
> > > 16058 host_ClientBlocks
> > >
> > > That numclients is getting HUGE, and the file server is
> > > sucking larger and
> > > larger amounts of memory.
> > >
> > > (Feel free to run xstat against afs[1-10].umr.edu. (Some
> are db only
> > > though.) Interestingly I also see this:
> > >
> > > -1090519756 rx_bogusHost
> > >
> > > (Output from xstat_fs_test afs4 1)
> > >
> > > On a server that has been rebooted recently due to this
> > > problem, I saw 594
> > > million on the bogusHost, and 293 thousand on the numClients.
> > > Somewhere
> > > there is a bad leak. Anyone else seen anything like this?
> > >
> > > We're contemplating re-enabling periodic (maybe monthly)
> > > server restarts at
> > > the moment, but would rather have a better fix.
> > >
> > > As far as the client bug - which appears to occur on
> > several different
> > > platforms - basically they just hang. They don't time out and
> > > see the file
> > > server go down, or anything. Now - the instant I kill that file
> > > server/reboot/firewall it, the clients ALL break loose
> > > immediately. The
> > > problem is basically that all of the afs clients are
> > > completely hung and
> > > won't respond to much of anything. This means that a single
> > > afs server going
> > > down in this way negates all benefit of replicated volumes.
> > >
> > > I have never been able to reproduce this symptom by
> > > suspending/firewalling/etc. a file server, the clients all see it
> > > immediately.
> > >
> > > If someone can give me any ideas on how I might reproduce
> > this failure
> > > symptom (i.e. dropped packets, whatever) I have a test cell
> > > that I will use
> > > to see about diagnosing the client and server, but at the
> > > moment, I do not
> > > have any way of reproducing the symptom. Whenever I've tried
> > > anything, the
> > > client always immediately sees the failure. What situation
> > > would cause the
> > > client to hang and not time out - there must be a particular
> > > location in the
> > > cache manager code where that situation can occur. (Or is it
> > > a case where
> > > it's semi-responding, but not enough to cause the client to
> > > break loose - if
> > > so, I wonder if there is some way to cause the client to be
> > > more sensitive
> > > and have a greater tendency to drop the server?)
> > >
> > > -- Nathan
> > >
> > > ------------------------------------------------------------
> > > Nathan Neulinger EMail: nneul@umr.edu
> > > University of Missouri - Rolla Phone: (573) 341-4841
> > > Computing Services Fax: (573) 341-4216
> > > _______________________________________________
> > > OpenAFS-devel mailing list
> > > OpenAFS-devel@openafs.org
> > > https://lists.openafs.org/mailman/listinfo/openafs-devel
> > >
> > _______________________________________________
> > OpenAFS-devel mailing list
> > OpenAFS-devel@openafs.org
> > https://lists.openafs.org/mailman/listinfo/openafs-devel
> >
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>