[OpenAFS-devel] Retry transaction creates on transient problems

Tom Keiser tkeiser@sinenomine.net
Fri, 24 Apr 2009 02:59:01 -0400


On Thu, Apr 16, 2009 at 11:11 AM, Rainer Toebbicke <rtb@pclella.cern.ch> wr=
ote:
> The attached patch causes a transient failure to create a volume transact=
ion
> to be retried, brutally three times in 1 sec intervals.
>
> The problem usually only affects servers with ten-thousands of volumes,
> where a simple "vos listvol" could easily disturb a simultaneous "vos
> backupsys", or one out of two simultaneous "vos listvols" could print
> thousands of error messages depending on how they race.
>
> Note: The patch "undoes" another patch (and its correction) in that area =
-
> that's not elegant but ok as it predates that patch and fixes both proble=
ms
> in one go, for the first one in a slightly different manner.
>

I don't think this patch is the right approach to the problem.  Server
threads should be treated as a precious resource.  Holding calls
active for several seconds at a time to circumvent the fact that
vos/libadmin lack sufficient retry logic is a suboptimal solution
which will reduce volop parallelism in large production environments.
Granted, having vos clients poll every few seconds is also suboptimal
from a fair queuing perspective, but there are other, better solutions
to that problem which don't unnecessarily tie up server threads.

Cheers,

-Tom



> bcc'ed to openafs-bugs
>
> --
> =3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D=
-=3D-=3D
> Rainer Toebbicke
> European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
> Phone: +41 22 767 8985 =A0 =A0 =A0 Fax: +41 22 767 7155
>
> Index: openafs/src/volser/voltrans.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> RCS file: /cvs/openafs/src/volser/voltrans.c,v
> retrieving revision 1.10.2.5
> diff -u -r1.10.2.5 voltrans.c
> --- openafs/src/volser/voltrans.c =A0 =A0 =A0 18 Oct 2008 19:26:17 -0000
> =A01.10.2.5
> +++ openafs/src/volser/voltrans.c =A0 =A0 =A0 16 Apr 2009 14:21:37 -0000
> @@ -75,21 +75,31 @@
> =A0NewTrans(afs_int32 avol, afs_int32 apart)
> =A0{
> =A0 =A0 /* set volid, next, partition */
> - =A0 =A0struct volser_trans *tt, *newtt;
> + =A0 =A0struct volser_trans *tt;
> =A0 =A0 struct timeval tp;
> =A0 =A0 struct timezone tzp;
> + =A0 =A0int retries =3D 3;
>
> - =A0 =A0newtt =3D (struct volser_trans *)malloc(sizeof(struct volser_tra=
ns));
> =A0 =A0 VTRANS_LOCK;
> =A0 =A0 /* don't allow the same volume to be attached twice */
> - =A0 =A0for (tt =3D allTrans; tt; tt =3D tt->next) {
> - =A0 =A0 =A0 if ((tt->volid =3D=3D avol) && (tt->partition =3D=3D apart)=
) {
> - =A0 =A0 =A0 =A0 =A0 VTRANS_UNLOCK;
> - =A0 =A0 =A0 =A0 =A0 free(newtt);
> - =A0 =A0 =A0 =A0 =A0 return (struct volser_trans *)0; =A0 =A0/* volume b=
usy */
> + =A0 =A0while (1) {
> + =A0 =A0 =A0 for (tt =3D allTrans; tt; tt =3D tt->next) {
> + =A0 =A0 =A0 =A0 =A0 if ((tt->volid =3D=3D avol) && (tt->partition =3D=
=3D apart)) break;
> =A0 =A0 =A0 =A0}
> + =A0 =A0 =A0 if (tt =3D=3D NULL) break;
> +
> + =A0 =A0 =A0 VTRANS_UNLOCK;
> + =A0 =A0 =A0 if (retries-- > 0)
> +#ifdef AFS_PTHREAD_ENV
> + =A0 =A0 =A0 =A0 =A0 sleep(1); =A0 =A0 =A0 =A0 =A0 /* Allow for short lo=
ck-ups */
> +#else
> + =A0 =A0 =A0 =A0 =A0 IOMGR_Sleep(1); =A0 =A0 /* Allow for short lock-ups=
 */
> +#endif /*AFS_PTHREAD_ENV*/
> + =A0 =A0 =A0 else
> + =A0 =A0 =A0 =A0 =A0 return (struct volser_trans *)0; =A0 =A0/* volume b=
usy */
> + =A0 =A0 =A0 VTRANS_LOCK;
> =A0 =A0 }
> - =A0 =A0tt =3D newtt;
> + =A0 =A0tt =3D (struct volser_trans *)malloc(sizeof(struct volser_trans)=
);
> =A0 =A0 memset(tt, 0, sizeof(struct volser_trans));
> =A0 =A0 tt->volid =3D avol;
> =A0 =A0 tt->partition =3D apart;
>
>



--=20
Tom Keiser
tkeiser@gmail.com