[OpenAFS] Re: Odd behavior during vos release

Kevin Hildebrand kevin@umd.edu
Wed, 9 Nov 2011 19:48:19 -0500 (EST)


Excellent.  I'm glad that there's a known cause and a fix to boot.  This 
is a big help.  I'll track down the recalcitrant client anyway for the 
sake of completeness, and that will likely speed up the vos release. 
Though you are correct, I don't really care how long the release takes, as 
long as it's not blocking clients from accessing data.

Thanks a bunch for your help!

Kevin

On Wed, 9 Nov 2011, Andrew Deason wrote:

> On Wed, 9 Nov 2011 17:49:51 -0500 (EST)
> Kevin Hildebrand <kevin@umd.edu> wrote:
>
>> For example:
>>
>> Connection from host 129.2.56.137, port 7001, Cuid a59e5fd1/37e0f99c
>>    serial 266,  natMTU 1444, security index 0, server conn
>>      call 0: # 220, state active, mode: error
>
> Okay. I don't think we expose the error code anywhere over the wire for
> rxdebug. You can get the error code either by looking at a core of the
> fileserver process and looking in the rx call structures, or by looking
> at a packet trace at around the time this happens (you should see an rx
> abort packet go by, which will have the abort code in it).
>
>> See /afs/glue.umd.edu/home/glue/k/e/kevin/pub/afs_debug/stacktrace.  You
>> are correct, most threads are in VGetVolume_r or VOffline_r.
>
> Yes, this appears to be the issue I was talking about. The client
> issuing the FetchData64 call in Thread 191/80 is holding a reference to
> the volume, and is (presumably) not consuming the data very quickly. The
> release will not continue and the other clients will not be able to be
> serviced until it finishes that FetchData64 call. Knowing what the
> client is requires examining a fileserver core. (Or you might be able to
> deduce it from looking at network traffic, if you really wanted to)
>
> Anyway, you want this patch:
> <http://git.openafs.org/?p=openafs.git;a=commitdiff_plain;h=2ad34a27105e591f40652e1a454ea7dc458686a1>
> That will not make the release go any faster, but it will prevent the
> other calls that occur at the same time from hanging. Instead, they
> should fail over to other available RO sites.
>
> If you want the release to go faster, there is a set of patches that
> allows you to specify a timeout for this situation, after which the
> problematic client will get kicked off so the release can proceed. Those
> changes are a bit more involved, though; I wouldn't bother with it
> unless the release delays are a problem for you.
>
> -- 
> Andrew Deason
> adeason@sinenomine.net
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>