[OpenAFS] Re: Afs User volume servers in VM's

Booker Bense bbense@slac.stanford.edu
Wed, 26 Oct 2011 10:32:15 -0700 (PDT)


On Wed, 26 Oct 2011, Andrew Deason wrote:

> On Wed, 26 Oct 2011 18:41:15 +0200
> Stephan Wiesand <stephan.wiesand@desy.de> wrote:
>
>
>> Booker and me would probably be ok with errors being returned upon
>> access to a single volume that's being overwhelmed with I/O requests -
>> if it just wouldn't make the fileserver as a whole grind to a halt and
>> not service any request any more.
>
> Well, see, it depends on _what_ is causing it to do that, as Jeffrey
> said. If the threads are hanging on a lock somewhere in the host package
> or Rx or something, this won't help a whole lot since we still have to
> go through those layers and we'll still hang on those locks (same thing
> for chewing up CPU, or moving memory around, etc). In fact, we'll do so
> even more, since we (eventually) have to go through all that at least
> twice for the VBUSY case.
>

The symptom we see is thread exhaustion due to write callbacks 
from many clients for a single volume[1]. The problem is 
insidious as it's not a gradual failure, because everything works just fine
until you hit a tipping point in the number of batch jobs.

It's often a file that the user isn't even aware they are 
opening, but is a small file used by some library they are
using. Sometimes tracking down the file can take significant
effort.

- Booker C. Bense

[1]- I'm not the stuckee when this happens, just an interested 
bystander so I may have the details slightly incorrect.