[OpenAFS] Re: Odd behavior during vos release

Thu, 10 Nov 2011 08:48:30 -0500 (EST)

So I was eventually able to find the bad client- for some reason one 
particular machine was taking almost 10 minutes to read a 2 meg file out 
of the volume in question- even though the network between the fileserver 
and that machine seemed to be okay.  I ended up rebooting the client, and 
now it seems happy again.

In case anyone else ends up in this situation, I ended up finding the bad 
client from some rxdebug output I had saved during a hang- 
the only client still doing anything looked like this:

Connection from host 129.2.163.45, port 7001, Cuid 9e59be0c/3ae9a2bc
   serial 3466,  natMTU 1444, security index 0, server conn
     call 0: # 78, state dally, mode: eof, flags: receive_done
     call 1: # 233, state active, mode: sending, flags: window_send 
receive_done, has_output_packets
     call 2: # 124, state dally, mode: eof, flags: receive_done
     call 3: # 0, state not initialized

And it was also the last entry in the log before the volume went offline-

Wed Nov  9 11:32:22 2011 [33] SRXAFS_FetchData, Fid = 1970897351.3208.3134057
Wed Nov  9 11:32:22 2011 [33] SRXAFS_FetchData, Fid = 1970897351.3208.3134057, Host 129.2.163.45:7001, Id 32766
Wed Nov  9 11:32:22 2011 [33] FetchData_RXStyle: Pos 0, Len 1048576
Wed Nov  9 11:32:22 2011 [33] FetchData_RXStyle: file size 3600423
...
Wed Nov  9 11:37:31 2011 [33] VOffline: Volume 1970897351 (s.common.readonly) is now offlineWed Nov  9 11:37:31 2011 [33]  (A volume utility is running.)Wed Nov  9 11:37:31 2011 [33]
Wed Nov  9 11:37:31 2011 [33] SRXAFS_FetchData returns 0

...followed by all of the hung clients getting freed up with VOFFLINE errors

Wed Nov  9 11:37:31 2011 [6] SAFS_FetchStatus returns 106
Wed Nov  9 11:37:31 2011 [9] SAFS_FetchStatus returns 106
Wed Nov  9 11:37:31 2011 [96] SAFS_FetchStatus returns 106

Kevin

On Wed, 9 Nov 2011, Kevin Hildebrand wrote:

>
> Excellent.  I'm glad that there's a known cause and a fix to boot.  This
> is a big help.  I'll track down the recalcitrant client anyway for the
> sake of completeness, and that will likely speed up the vos release.
> Though you are correct, I don't really care how long the release takes, as
> long as it's not blocking clients from accessing data.
>
> Thanks a bunch for your help!
>
> Kevin
>
> On Wed, 9 Nov 2011, Andrew Deason wrote:
>
>> On Wed, 9 Nov 2011 17:49:51 -0500 (EST)
>> Kevin Hildebrand <kevin@umd.edu> wrote:
>>
>>> For example:
>>>
>>> Connection from host 129.2.56.137, port 7001, Cuid a59e5fd1/37e0f99c
>>>    serial 266,  natMTU 1444, security index 0, server conn
>>>      call 0: # 220, state active, mode: error
>>
>> Okay. I don't think we expose the error code anywhere over the wire for
>> rxdebug. You can get the error code either by looking at a core of the
>> fileserver process and looking in the rx call structures, or by looking
>> at a packet trace at around the time this happens (you should see an rx
>> abort packet go by, which will have the abort code in it).
>>
>>> See /afs/glue.umd.edu/home/glue/k/e/kevin/pub/afs_debug/stacktrace.  You
>>> are correct, most threads are in VGetVolume_r or VOffline_r.
>>
>> Yes, this appears to be the issue I was talking about. The client
>> issuing the FetchData64 call in Thread 191/80 is holding a reference to
>> the volume, and is (presumably) not consuming the data very quickly. The
>> release will not continue and the other clients will not be able to be
>> serviced until it finishes that FetchData64 call. Knowing what the
>> client is requires examining a fileserver core. (Or you might be able to
>> deduce it from looking at network traffic, if you really wanted to)
>>
>> Anyway, you want this patch:
>> <http://git.openafs.org/?p=openafs.git;a=commitdiff_plain;h=2ad34a27105e591f40652e1a454ea7dc458686a1>
>> That will not make the release go any faster, but it will prevent the
>> other calls that occur at the same time from hanging. Instead, they
>> should fail over to other available RO sites.
>>
>> If you want the release to go faster, there is a set of patches that
>> allows you to specify a timeout for this situation, after which the
>> problematic client will get kicked off so the release can proceed. Those
>> changes are a bit more involved, though; I wouldn't bother with it
>> unless the release delays are a problem for you.
>>
>> --
>> Andrew Deason
>> adeason@sinenomine.net
>>
>> _______________________________________________
>> OpenAFS-info mailing list
>> OpenAFS-info@openafs.org
>> https://lists.openafs.org/mailman/listinfo/openafs-info
>>
>