[OpenAFS] Re: Odd behavior during vos release

Andrew Deason adeason@sinenomine.net
Wed, 9 Nov 2011 15:53:12 -0600

On Wed, 9 Nov 2011 14:38:10 -0500 (EST)
Kevin Hildebrand <kevin@umd.edu> wrote:

> What I'm observing is that as soon as the vos release begins, one or
> more of the readonly replicas start accumulating connections in the
> 'error' state.

The connection has an error, or individual calls? I'm assuming you are
seeing this via rxdebug; do you see

Connection from host x.x.x.x, port x, Cuid x, error x

or do you see

    call x: # x, state active, mode: error

Or better yet, just give specifically what you see :)

> FileLog shows incoming FetchStatus RPCs to that replica are not being
> answered.  If this condition occurs long enough, all of these
> connections eventually fill up the thread pool and the fileserver
> stops serving data to everything else.
> At some point, up to five minutes later, as the release proceeds, the
> replica in question gets marked offline by the release process.  At
> this time, all of the stuck RPCs get 'FetchStatus returns 106'
> (VOFFLINE), at which point the connection pool clears, and life on the
> fileserver returns to normal.

There is a known situation in which a client can hold a reference to the
volume for longish periods of time, which prevents the volume from going
offline and causes some responses to hang and build up. But there's some
related fixes for it; what versions are in play here?

> What I can't figure out is what's going on during the time the RPCs
> are hung, and why the connections show 'error'.  (How does one
> determine what the error condition is, when viewing rxdebug output?)
> Why would an RO replica be hung during a vos release?

You can see where the threads are hanging by getting a backtrace of all
of the threads. You can run 'pstack <fileserver pid>' to get this, or
generate a core and examine with a debugger. If you're on Linux, run
'gcore <fileserver pid>' and run 'gdb <fileserver binary> <core>' then
do something like:

(gdb) set height 0
(gdb) set width 0
(gdb) set logging file /tmp/some/file
(gdb) set logging on
(gdb) thread apply all bt
(gdb) quit

And put that output up somewhere. There might be a little sensitive
information in that (filenames wold be the most likely thing), but you
should be able to tell whether or not you care by just looking at it. If
the issue I mention above is relevant, if I recall correctly you'll see
several threads inside VGetVolume_r or similar, one of which being
inside VOffline_r.

Andrew Deason