[OpenAFS] fileserver crashes

Wed, 13 Oct 2004 14:05:51 -0400

Jeffrey Hutzelman <jhutz@cmu.edu> writes:
> From: Jeffrey Hutzelman <jhutz@cmu.edu>
> To: Marcus Watts <mdw@umich.edu>, openafs-info@openafs.org
> Subject: Re: [OpenAFS] fileserver crashes
> Message-ID: <478470000.1097679874@minbar.fac.cs.cmu.edu>
> In-Reply-To: <200410130743.DAA15853@quince.ifs.umich.edu>
> References:  <200410130743.DAA15853@quince.ifs.umich.edu>
> Date: Wed, 13 Oct 2004 11:04:34 -0400
> 
> > I don't know if our fileserver crashes are related to what Matthew Cocker
> > and others are seeing, but we are indeed seeing problems here at umich.
> >
> > Background information:
> > The machines in question are dual pentium 4 machines with
> > hyperthreading enabled running linux 2.4.26 (SMP) and glibc 2.3.2.  The
> > actual file storage is on "cheap" raid devices that use multiple IDE
> > drives but talk SCSI to the rest of the world.  These raids have their
> > own set of problems, so I would not count them as super-reliable file
> > storage.  We're running the "pthreads" version of the fileserver.
> >
> > I think we're seeing at least 3 distinct problems with openafs 1.2.11.
> >
> > The first may actually be networking.  We get these with varying
> > frequency in VolserLog:
> > Sun Oct 10 22:05:09 2004 1 Volser: DumpVolume: Rx call failed during 
> dump, error -1
> > Tue Oct 12 11:38:07 2004 1 Volser: DumpVolume: Rx call failed during 
> dump, error -1
> > Tue Oct 12 13:39:23 2004 1 Volser: DumpVolume: Rx call failed during 
> dump, error -1
> > Tue Oct 12 15:06:46 2004 1 Volser: DumpVolume: Rx call failed during 
> dump, error -1
> > Helpful message, eh?  Can't tell what volume was being dumped,
> > or where it was going.
> 
> Well, -1 basically means the rx connection timed out.  There should be a
> corresponding error on whatever client was doing the dump, unless the
> issue was that that client decided to abort the call.  We see that all the
> time, because there are cases where our backup system will parse the start
> of a volume dump, decide it doesn't want it after all, and abort.

That's nice, but right off the top of my head I can think of 3
possibilities for "the client" -- hdserver, vos run by an
administrator, or the afs backup software, and each of those poses
problems in terms of collecting error messages.  hdserver keeps a log,
but there aren't any obviously related failures there (there aren't any
messages at all for some of these time periods.)  Even if there was a
failure, I don't know how much sense we could make of it; hdserver is
capable of doing multiple vos releases more or less in parallel at
once.  Our backup system apparently was running out of TSM client
licenses for a while and aborting - so that could have been the cause
of many of these, but as you already observed, there's no way to tell
which of those failures is related to which of these messages.  And,
presumably contrary to the experience at most sites, our administrators
have been notoriously reluctant to undergo brain surgery to install the
necessary hardware so that we can screen scrap their heads to collect
error messages on demand.  Is there any reason that error message
couldn't be a little more helpful?

> 
> 
> > We have at various times gotten problems with read-only replicas that
> > are oddly truncated.  This might or might not be the consequence of the
> > previous problem.
> 
> Hm.  That sounds familiar, but I thought that bug was fixed some time ago.
> In fact, Derrick confirms that the fix is in 1.2.11
> 
> > Another probably completely different problem we have concerns volumes
> > with really small volume IDs.  Modern AFS software creates large 10
> > digit volume IDs.  But we have volumes that were created long before
> > AFS 3.1, with small 3 digit volume IDs.  Those volumes are rapidly
> > disappearing as one by one, during various restarts, the fileserver and
> > salvager proceed to discard all the data, then the volume header.
> 
> That's... bizarre.  I've never heard of such a thing, but then, we don't
> have any Linux fileservers in our cell.  I understand the Andrew cell was
> seeing this for a while, but it went away without anyone successfully
> debugging it.

Well, it may be going away in our cell too -- clearly, natural selection
is busy removing problem volumes one by one.  I can't say that it's leaving
me with a comfortable feeling though.

> 
> 
> The last problem you describe sounds suspiciously like something Derrick
> has been trying to track down for the last 2 or 3 weeks.  I'll leave that
> to him, since he has a better idea than I of the current status of that.

I'll respond to that separately.  Thanks!

				-Marcus Watts
				UM ITCS Umich Systems Group