[OpenAFS] Odd behavior during vos release
Kevin Hildebrand
kevin@umd.edu
Wed, 9 Nov 2011 14:38:10 -0500 (EST)
We've been having unusual slowness and hangs at times on some of our
fileservers, and I think I have a handle on the sequence of events, if not
the cause. I could use some assistance in filling in the gaps so I can
see if we can fix things.
Right now, I have a heavily used volume (by many clients) that is released
on a frequent basis (as often as every ten minutes). This volume has
three read-only replicas. The volume is about 200MB in size.
What I'm observing is that as soon as the vos release begins, one or more
of the readonly replicas start accumulating connections in the 'error'
state. FileLog shows incoming FetchStatus RPCs to that replica are not
being answered. If this condition occurs long enough, all of these
connections eventually fill up the thread pool and the fileserver stops
serving data to everything else.
At some point, up to five minutes later, as the release proceeds, the
replica in question gets marked offline by the release process. At this
time, all of the stuck RPCs get 'FetchStatus returns 106' (VOFFLINE), at
which point the connection pool clears, and life on the fileserver returns
to normal.
What I can't figure out is what's going on during the time the RPCs are
hung, and why the connections show 'error'. (How does one determine what
the error condition is, when viewing rxdebug output?)
Why would an RO replica be hung during a vos release?
Any clues on where to look next would be appreciated.
Thanks,
Kevin
--
Kevin Hildebrand
University of Maryland, College Park
Office of Information Technology