[OpenAFS-devel] Retry transaction creates on transient problems

Rainer Toebbicke rtb@pclella.cern.ch
Fri, 24 Apr 2009 18:05:17 +0200


Well, I agree with the criticism that there are better ways to handle 
concurrency problems than macroscopic poll loops. No better and still cheap 
solution did occur to me, which does not mean that there isn't.

The effect however is rare, even in "large production environments" from which 
I believe we aren't that far away. The effect of thread tie-up on the 
volserver is also negligible from what I see.

The "definite", well engineered solution to the problem is very likely beyond 
just a few lines, will produce a substantial debugging tail and consume 
development resources that might be useful elsewhere. The patch solves a 
concrete problem, namely daily backups failing on a several-thousand-volume 
server hence we need it or something equivalent. I decided to share it, 
perhaps there are suggestions to improve it which are worth the additional 
effort.


Tom Keiser schrieb:
> On Thu, Apr 16, 2009 at 11:11 AM, Rainer Toebbicke <rtb@pclella.cern.ch> wrote:
>> The attached patch causes a transient failure to create a volume transaction
>> to be retried, brutally three times in 1 sec intervals.
>>
>> The problem usually only affects servers with ten-thousands of volumes,
>> where a simple "vos listvol" could easily disturb a simultaneous "vos
>> backupsys", or one out of two simultaneous "vos listvols" could print
>> thousands of error messages depending on how they race.
>>
>> Note: The patch "undoes" another patch (and its correction) in that area -
>> that's not elegant but ok as it predates that patch and fixes both problems
>> in one go, for the first one in a slightly different manner.
>>
> 
> I don't think this patch is the right approach to the problem.  Server
> threads should be treated as a precious resource.  Holding calls
> active for several seconds at a time to circumvent the fact that
> vos/libadmin lack sufficient retry logic is a suboptimal solution
> which will reduce volop parallelism in large production environments.
> Granted, having vos clients poll every few seconds is also suboptimal
> from a fair queuing perspective, but there are other, better solutions
> to that problem which don't unnecessarily tie up server threads.
> 
> Cheers,
> 
> -Tom
> 
> 
> 


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155