[OpenAFS-devel] ARGH... afs outage... and same stinking clien
t bug that it doesn' t ever see it...
Neulinger, Nathan
nneul@umr.edu
Tue, 2 Oct 2001 13:25:26 -0500
Just a thought - does the file server maintain any state information for
authenticated users?
Could this be a case of NUM-HOSTS * NUM-USERS * NUM-SEPARATE-PAGS = LARGE
NUMBER?
I see all the code the increments/decrements CEs in viced/host.c.
Would be nice to have a kdump equivalent for the file server process.
-- Nathan
> -----Original Message-----
> From: Neulinger, Nathan [mailto:nneul@umr.edu]
> Sent: Tuesday, October 02, 2001 1:04 PM
> To: 'openafs-devel@openafs.org'
> Subject: [OpenAFS-devel] ARGH... afs outage... and same
> stinking client
> bug that it doesn' t ever see it...
>
>
> We had another afs outage here, and once again, the same client bug is
> causing the clients to never see the failure.
>
> This was the same fileserver bug that I reported a few weeks
> ago where it is
> accumulating hundreds/millions of host entries:
>
> 37653 host_NumHostEntries
> 74 host_HostBlocks
> 1533 host_NonDeletedHosts
> 1508 host_HostsInSameNetOrSubnet
> 3 host_HostsInDiffSubnet
> 22 host_HostsInDiffNetwork
> 1172202 host_NumClients
> 16058 host_ClientBlocks
>
> That numclients is getting HUGE, and the file server is
> sucking larger and
> larger amounts of memory.
>
> (Feel free to run xstat against afs[1-10].umr.edu. (Some are db only
> though.) Interestingly I also see this:
>
> -1090519756 rx_bogusHost
>
> (Output from xstat_fs_test afs4 1)
>
> On a server that has been rebooted recently due to this
> problem, I saw 594
> million on the bogusHost, and 293 thousand on the numClients.
> Somewhere
> there is a bad leak. Anyone else seen anything like this?
>
> We're contemplating re-enabling periodic (maybe monthly)
> server restarts at
> the moment, but would rather have a better fix.
>
> As far as the client bug - which appears to occur on several different
> platforms - basically they just hang. They don't time out and
> see the file
> server go down, or anything. Now - the instant I kill that file
> server/reboot/firewall it, the clients ALL break loose
> immediately. The
> problem is basically that all of the afs clients are
> completely hung and
> won't respond to much of anything. This means that a single
> afs server going
> down in this way negates all benefit of replicated volumes.
>
> I have never been able to reproduce this symptom by
> suspending/firewalling/etc. a file server, the clients all see it
> immediately.
>
> If someone can give me any ideas on how I might reproduce this failure
> symptom (i.e. dropped packets, whatever) I have a test cell
> that I will use
> to see about diagnosing the client and server, but at the
> moment, I do not
> have any way of reproducing the symptom. Whenever I've tried
> anything, the
> client always immediately sees the failure. What situation
> would cause the
> client to hang and not time out - there must be a particular
> location in the
> cache manager code where that situation can occur. (Or is it
> a case where
> it's semi-responding, but not enough to cause the client to
> break loose - if
> so, I wonder if there is some way to cause the client to be
> more sensitive
> and have a greater tendency to drop the server?)
>
> -- Nathan
>
> ------------------------------------------------------------
> Nathan Neulinger EMail: nneul@umr.edu
> University of Missouri - Rolla Phone: (573) 341-4841
> Computing Services Fax: (573) 341-4216
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>