[OpenAFS] Re: Odd behavior during vos release

Wed, 9 Nov 2011 17:54:56 -0600

On Wed, 9 Nov 2011 17:49:51 -0500 (EST)
Kevin Hildebrand <kevin@umd.edu> wrote:

> For example:
> 
> Connection from host 129.2.56.137, port 7001, Cuid a59e5fd1/37e0f99c
>    serial 266,  natMTU 1444, security index 0, server conn
>      call 0: # 220, state active, mode: error

Okay. I don't think we expose the error code anywhere over the wire for
rxdebug. You can get the error code either by looking at a core of the
fileserver process and looking in the rx call structures, or by looking
at a packet trace at around the time this happens (you should see an rx
abort packet go by, which will have the abort code in it).

> See /afs/glue.umd.edu/home/glue/k/e/kevin/pub/afs_debug/stacktrace.  You 
> are correct, most threads are in VGetVolume_r or VOffline_r.

Yes, this appears to be the issue I was talking about. The client
issuing the FetchData64 call in Thread 191/80 is holding a reference to
the volume, and is (presumably) not consuming the data very quickly. The
release will not continue and the other clients will not be able to be
serviced until it finishes that FetchData64 call. Knowing what the
client is requires examining a fileserver core. (Or you might be able to
deduce it from looking at network traffic, if you really wanted to)

Anyway, you want this patch:
<http://git.openafs.org/?p=openafs.git;a=commitdiff_plain;h=2ad34a27105e591f40652e1a454ea7dc458686a1>
That will not make the release go any faster, but it will prevent the
other calls that occur at the same time from hanging. Instead, they
should fail over to other available RO sites.

If you want the release to go faster, there is a set of patches that
allows you to specify a timeout for this situation, after which the
problematic client will get kicked off so the release can proceed. Those
changes are a bit more involved, though; I wouldn't bother with it
unless the release delays are a problem for you.

-- 
Andrew Deason
adeason@sinenomine.net