[OpenAFS] Odd behavior during vos release

Kevin Hildebrand kevin@umd.edu
Wed, 9 Nov 2011 14:38:10 -0500 (EST)


We've been having unusual slowness and hangs at times on some of our 
fileservers, and I think I have a handle on the sequence of events, if not 
the cause.  I could use some assistance in filling in the gaps so I can 
see if we can fix things.

Right now, I have a heavily used volume (by many clients) that is released 
on a frequent basis (as often as every ten minutes).  This volume has 
three read-only replicas.  The volume is about 200MB in size.

What I'm observing is that as soon as the vos release begins, one or more 
of the readonly replicas start accumulating connections in the 'error' 
state.  FileLog shows incoming FetchStatus RPCs to that replica are not 
being answered.  If this condition occurs long enough, all of these 
connections eventually fill up the thread pool and the fileserver stops 
serving data to everything else.

At some point, up to five minutes later, as the release proceeds, the 
replica in question gets marked offline by the release process.  At this 
time, all of the stuck RPCs get 'FetchStatus returns 106' (VOFFLINE), at 
which point the connection pool clears, and life on the fileserver returns 
to normal.

What I can't figure out is what's going on during the time the RPCs are 
hung, and why the connections show 'error'.  (How does one determine what 
the error condition is, when viewing rxdebug output?)
Why would an RO replica be hung during a vos release?

Any clues on where to look next would be appreciated.

Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland, College Park
Office of Information Technology