[OpenAFS] Re: Investigating 'calls waiting' from rxdebug

drosih@rpi.edu drosih@rpi.edu
Fri, 16 Aug 2013 15:36:17 -0400


On Fri, 16 Aug 2013 12:15:25 EDT Andrew Deason wrote:

If those show very little, though, you probably have a thread actually
hanging on something else, so you won't see a lot of activity. (If
you show very little disk, net, and cpu usage at the same time, that
seems pretty likely.) In that case you'd need to look at a stack trace
for the fileserver process, or ideally capture a core to be examined
later.

Dan's message looks very useful, and also makes me feel good because
it implies that I was making some good guesses as I tried to pin down
this problem.  I did try to turn up logging at one point, and here
are all the log entries which came up in FileLog:

    Thu Aug 15 02:34:58 2013 Set Debug On level = 1
    Thu Aug 15 02:35:08 2013 [0] Set Debug On level = 5
    Thu Aug 15 02:35:18 2013 [0] Reset Debug levels to 0

That's it.  3 entries.

Now by the time I tried that it was very late (2am, obviously), so
it's vaguely *possible* that all the workstations which were doing
I/O were already in a call-waiting state.  But my guess is that we
had some thread which really was hanging on something else.

I did also try doing some tcpdumps and summarizing that traffic,
but nothing remarkable showed up.  However earlier today I learned
that the way I did that might have generated misleading results
(for reasons I won't bore you with right now).  But based on those
tcpdumps I doubt we were getting hammered with AFS traffic,
especially not for such a long stretch of time in the middle of
the summer.

I'll also say that at one point I thought the problem might have
been that we had too many AFS volumes on one of the partitions
on the "calls-waiting" server, so I started doing 'vos move's to
move AFS volumes to a different server.  None of those vos moves
ran into any lags at all, even while the calls-waiting counter
was very high.

Thanks for the answers.  These will be helpful if the problem
shows up again, and I suspect it will.  And that will probably be
on the next time I try to take a vacation day!

[okay, let's see how badly webmail mangles THIS message...]

-- 
Garance Alistair Drosehn                =     drosih@rpi.edu
Senior Systems Programmer               or   gad@FreeBSD.org
Rensselaer Polytechnic Institute;             Troy, NY;  USA