[OpenAFS] Re: Odd behavior during vos release

Kevin Hildebrand kevin@umd.edu
Wed, 9 Nov 2011 17:49:51 -0500 (EST)


On Wed, 9 Nov 2011, Andrew Deason wrote:

> On Wed, 9 Nov 2011 14:38:10 -0500 (EST)
> Kevin Hildebrand <kevin@umd.edu> wrote:
>
>> What I'm observing is that as soon as the vos release begins, one or
>> more of the readonly replicas start accumulating connections in the
>> 'error' state.
>
> The connection has an error, or individual calls? I'm assuming you are
> seeing this via rxdebug; do you see
>
> Connection from host x.x.x.x, port x, Cuid x, error x
>
> or do you see
>
>    call x: # x, state active, mode: error
>
> Or better yet, just give specifically what you see :)
>

For example:

Connection from host 129.2.56.137, port 7001, Cuid a59e5fd1/37e0f99c
   serial 266,  natMTU 1444, security index 0, server conn
     call 0: # 220, state active, mode: error
     call 1: # 21, state dally, mode: eof, flags: receive_done
     call 2: # 0, state not initialized
     call 3: # 0, state not initialized
Connection from host 128.8.163.75, port 7001, Cuid 96a0fa27/38de8e1c
   serial 86,  natMTU 1444, security index 0, server conn
     call 0: # 26, state active, mode: error
     call 1: # 23, state not initialized
     call 2: # 0, state not initialized
     call 3: # 0, state not initialized

>> FileLog shows incoming FetchStatus RPCs to that replica are not being
>> answered.  If this condition occurs long enough, all of these
>> connections eventually fill up the thread pool and the fileserver
>> stops serving data to everything else.
>>
>> At some point, up to five minutes later, as the release proceeds, the
>> replica in question gets marked offline by the release process.  At
>> this time, all of the stuck RPCs get 'FetchStatus returns 106'
>> (VOFFLINE), at which point the connection pool clears, and life on the
>> fileserver returns to normal.
>
> There is a known situation in which a client can hold a reference to the
> volume for longish periods of time, which prevents the volume from going
> offline and causes some responses to hang and build up. But there's some
> related fixes for it; what versions are in play here?
>

1.4.14, clients and servers.

>> What I can't figure out is what's going on during the time the RPCs
>> are hung, and why the connections show 'error'.  (How does one
>> determine what the error condition is, when viewing rxdebug output?)
>> Why would an RO replica be hung during a vos release?
>
> You can see where the threads are hanging by getting a backtrace of all
> of the threads. You can run 'pstack <fileserver pid>' to get this, or
> generate a core and examine with a debugger. If you're on Linux, run
> 'gcore <fileserver pid>' and run 'gdb <fileserver binary> <core>' then
> do something like:
>
> (gdb) set height 0
> (gdb) set width 0
> (gdb) set logging file /tmp/some/file
> (gdb) set logging on
> (gdb) thread apply all bt
> (gdb) quit
>
> And put that output up somewhere. There might be a little sensitive
> information in that (filenames wold be the most likely thing), but you
> should be able to tell whether or not you care by just looking at it. If
> the issue I mention above is relevant, if I recall correctly you'll see
> several threads inside VGetVolume_r or similar, one of which being
> inside VOffline_r.
>

See /afs/glue.umd.edu/home/glue/k/e/kevin/pub/afs_debug/stacktrace.  You 
are correct, most threads are in VGetVolume_r or VOffline_r.

And in regards to Derrick's request for the timed vos release, if still 
needed I'll tackle that tomorrow morn.

Thanks,
Kevin


> -- 
> Andrew Deason
> adeason@sinenomine.net
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>