[OpenAFS] volume 536871264 is busy or server is down, recheck

Booker Bense bbense@slac.stanford.edu
Thu, 1 Apr 2010 12:44:23 -0700 (PDT)


On Thu, 1 Apr 2010, Jeffrey Altman wrote:

> On 3/31/2010 10:33 PM, ?? wrote:
>> Hi,
>>
>> I want to know how many parallel  read requests for one volume at the
>> same time? or how many parallel read requests for one replication volume
>> at the same time?
>>
>> In our afs system, there are about one hundred people to read a volume
>> parallelly, and each people will issus about 500 read requests. I found
>> the afs client's /var/log/message file often appear  some error
>> information, such as "volume 536871264 is busy or server is down, recheck ".
>>

Our experience is that AFS and a large batch farm is a denial of 
service waiting to happen for rw volumes. What happens 
is that each batch process registers a callback for volume it is
writing to and eventually the server gets starved for available 
threads and all the volumes served by that server suffer 
performance hits. Essentially the read requests are limited by
the number of threads on the server for the volume.

We have a constant user education problem with this, especially 
since the tipping point doesn't get triggered until the user is
sure everything is working and "scales up" their runs to several
hundred simultaneous batch jobs.

In theory a read only replica volume should not be nearly as
resource intensive. However, we have found this is rarely
the case.

I suspect your real problem is that the jobs are opening dot 
files or configuration/logging files in some volume that is also 
on the same server as the volume you are reading from. Most 
applications have some library that assumes reading/writing to
small files in the home directory will never be a problem.

AFS scales really well under the assumption of many machines each
accessing different volumes, it crashes and burns when the 
scenario switches to many machines accessing the same volume.

_ Booker C. Bense