[OpenAFS] fileserver crashes

Jeffrey Hutzelman jhutz@cmu.edu
Wed, 13 Oct 2004 11:04:34 -0400


> I don't know if our fileserver crashes are related to what Matthew Cocker
> and others are seeing, but we are indeed seeing problems here at umich.
>
> Background information:
> The machines in question are dual pentium 4 machines with
> hyperthreading enabled running linux 2.4.26 (SMP) and glibc 2.3.2.  The
> actual file storage is on "cheap" raid devices that use multiple IDE
> drives but talk SCSI to the rest of the world.  These raids have their
> own set of problems, so I would not count them as super-reliable file
> storage.  We're running the "pthreads" version of the fileserver.
>
> I think we're seeing at least 3 distinct problems with openafs 1.2.11.
>
> The first may actually be networking.  We get these with varying
> frequency in VolserLog:
> Sun Oct 10 22:05:09 2004 1 Volser: DumpVolume: Rx call failed during 
dump, error -1
> Tue Oct 12 11:38:07 2004 1 Volser: DumpVolume: Rx call failed during 
dump, error -1
> Tue Oct 12 13:39:23 2004 1 Volser: DumpVolume: Rx call failed during 
dump, error -1
> Tue Oct 12 15:06:46 2004 1 Volser: DumpVolume: Rx call failed during 
dump, error -1
> Helpful message, eh?  Can't tell what volume was being dumped,
> or where it was going.

Well, -1 basically means the rx connection timed out.  There should be a
corresponding error on whatever client was doing the dump, unless the
issue was that that client decided to abort the call.  We see that all the
time, because there are cases where our backup system will parse the start
of a volume dump, decide it doesn't want it after all, and abort.


> We have at various times gotten problems with read-only replicas that
> are oddly truncated.  This might or might not be the consequence of the
> previous problem.

Hm.  That sounds familiar, but I thought that bug was fixed some time ago.
In fact, Derrick confirms that the fix is in 1.2.11

> Another probably completely different problem we have concerns volumes
> with really small volume IDs.  Modern AFS software creates large 10
> digit volume IDs.  But we have volumes that were created long before
> AFS 3.1, with small 3 digit volume IDs.  Those volumes are rapidly
> disappearing as one by one, during various restarts, the fileserver and
> salvager proceed to discard all the data, then the volume header.

That's... bizarre.  I've never heard of such a thing, but then, we don't
have any Linux fileservers in our cell.  I understand the Andrew cell was
seeing this for a while, but it went away without anyone successfully
debugging it.


The last problem you describe sounds suspiciously like something Derrick
has been trying to track down for the last 2 or 3 weeks.  I'll leave that
to him, since he has a better idea than I of the current status of that.