[OpenAFS-devel] Retry transaction creates on transient problems
Rainer Toebbicke
rtb@pclella.cern.ch
Fri, 24 Apr 2009 18:05:17 +0200
Well, I agree with the criticism that there are better ways to handle
concurrency problems than macroscopic poll loops. No better and still cheap
solution did occur to me, which does not mean that there isn't.
The effect however is rare, even in "large production environments" from which
I believe we aren't that far away. The effect of thread tie-up on the
volserver is also negligible from what I see.
The "definite", well engineered solution to the problem is very likely beyond
just a few lines, will produce a substantial debugging tail and consume
development resources that might be useful elsewhere. The patch solves a
concrete problem, namely daily backups failing on a several-thousand-volume
server hence we need it or something equivalent. I decided to share it,
perhaps there are suggestions to improve it which are worth the additional
effort.
Tom Keiser schrieb:
> On Thu, Apr 16, 2009 at 11:11 AM, Rainer Toebbicke <rtb@pclella.cern.ch> wrote:
>> The attached patch causes a transient failure to create a volume transaction
>> to be retried, brutally three times in 1 sec intervals.
>>
>> The problem usually only affects servers with ten-thousands of volumes,
>> where a simple "vos listvol" could easily disturb a simultaneous "vos
>> backupsys", or one out of two simultaneous "vos listvols" could print
>> thousands of error messages depending on how they race.
>>
>> Note: The patch "undoes" another patch (and its correction) in that area -
>> that's not elegant but ok as it predates that patch and fixes both problems
>> in one go, for the first one in a slightly different manner.
>>
>
> I don't think this patch is the right approach to the problem. Server
> threads should be treated as a precious resource. Holding calls
> active for several seconds at a time to circumvent the fact that
> vos/libadmin lack sufficient retry logic is a suboptimal solution
> which will reduce volop parallelism in large production environments.
> Granted, having vos clients poll every few seconds is also suboptimal
> from a fair queuing perspective, but there are other, better solutions
> to that problem which don't unnecessarily tie up server threads.
>
> Cheers,
>
> -Tom
>
>
>
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985 Fax: +41 22 767 7155