[OpenAFS] "stillborn client" in src/viced/host.c

Jose Calhariz jose.calhariz@tagus.ist.utl.pt
Sun, 12 Nov 2006 19:05:49 +0000


--gBBFr7Ir9EOA20Yy
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Oct 17, 2006 at 05:47:40PM -0400, Derrick J Brashear wrote:
> On Tue, 17 Oct 2006, Bill Stivers wrote:
>=20
> >Hey all:
> >
> >Thanks much for the tremendous help you've provided to me and my cohorts.
> >
> >I have another lame question.  I've been speaking to Joe about error=20
> >information he's seeing on our AFS servers, and he noted one particularl=
y=20
> >odd one:
> >
> >Tue Oct 17 13:15:59 2006 FindClient: stillborn client b3f528(b17753cc);
> >conn b5b348 (host 128.114.104.230:7001) had client b4c030(b17753cc)
> >Mon Oct 16 19:27:28 2006 FindClient: stillborn client b3f7c8(2b1cc290);
> >conn b450d0 (host 128.114.30.230:7001) had client 6d52f0(2b1cc290)
> >
> >I looked at the code, and found the lines that are generating the messag=
e=20
> >in src/viced/host.c, which are as follows:
> >
> >Can someone who knows the codebase well shed some light as to what's goi=
ng=20
> >on?  Is this another one of those: "You have OpenAFS in part of your=20
> >infrastructure and TransARC in part of it" issues?  is this, perhaps, pa=
rt=20
> >of the locking code?
>=20
> We can remove that. It was a potential race we have cleaned up, something=
=20
> in the logs so we'd know it happened if we needed to debug something.
>=20
> >I'm trying to do due diligence to make sure my clients aren't partially =
to=20
> >blame for some of the things that our server administrators are fixing=
=20
> >now, and this is part of that effort.
>=20
> Well, it's more likely to happen with old windows clients, but, it can=20
> happen with any client, depending on circumstance.

I can confirm that it can happen with Linux and a recent client.

I am testing the stability of openafs 1.4.2 in Linux, both in the
client and in the server.  So I have a Linux client endless compiling
the Linux kernel using make -j 35 over an afs volume in a Linux
Fileserver.  I have in the past run this test for days without
problems.  This time the compilation aborts with errors, normally an
indication of problems with hardware or kernel/software problems.

The last error I have seen was:

ar: sound/drivers/opl4/built-in.o: File format not recognized

In the client there is no indication of a problem with AFS, but on the
fileserver I have seen the same error sometime before the compilation
aborted:=20

Sun Nov 12 14:05:05 2006 FindClient: stillborn client 8185ef0(785cb670); co=
nn 8149220 (host 172.20.15.79:7001) had client 8185f98(785cb670)

So in my point of view, this kind of error can result in data corruption.


    Jos=E9 Calhariz

--=20
	Deve-se temer a velhice, porque ela nunca vem so.  Bengalas=20
	sao provas de idade e nao de prudencia.
		-- Platao

--gBBFr7Ir9EOA20Yy
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFV3CNVNiv5i0lZUgRAjXxAJ9KcY9ZXrbqu8wruKZT8u7rK4jdPACeM87Z
nZrKB0RfCKYfJGPCs2b5fyc=
=KiN/
-----END PGP SIGNATURE-----

--gBBFr7Ir9EOA20Yy--