[OpenAFS] volume corruption: directory references disappear!?!

Sat, 27 Jul 2002 11:08:46 -0400 (EDT)

   Hello Christopher,

   We have one system with the same 3Ware controller. This host also has 
an IBM LT tape drive connected, and the thinking was to merge the backup 
server with an empty file server for quick recoveries in case of a RAID 
system failure out in the wild. I should note that our lab spans four 
buildings, so we're implementing local RAID file servers for users in each 
building. 

   All that said, I haven't seen volume corruption like that since
upgrading the servers to OpenAFS-1.2.5. I have seen one serious crash with
an OOPs on one of our fileservers when it hit very high load. The load
spiked to over 8.0 due to excessive client LDAP queries coming from a
cronjob running on the desktops. I fixed that. :) Client side we're
running RH73, Kernel 2.4.18-31, OAFS-1.2.5. Server side we're running
Debian, Kernel-2.4.17, OAFS-1.2.5. The biggest gamble I'm taking is
running the /vicepx partitions on my fileservers with ext3fs. I've read
some posts here which suggest this may not be prudent on my part, though I
note that the volume corruption I experienced under OAFS-1.2.3 happened on
/vicepx partitions which were standard ext2fs.

   I also note that the volume corruption I experianced was total, though
clone volumes could be mounted and read. So I was able to create new
temporary volumes, tar the contents of the original over to a new volume,
and mount that. Performing a vos dump or backup dump simply resulted in a
failed operation due to volume errors. We also had standard AFS backups,
so no users lost any significant data. But the experience does still
concern me.  I'm deploying what will be about 200 OAFS clients in our lab,
so server stability is of serious concern right now.

   If I can be of any help please don't hesitate to ask. 

--Maynard

On Fri, 26 Jul 2002, Christopher Arnold wrote:

> 
> We're currently running 1.25 servers and clients on 7.1 and 7.3 redhat
> linux machines. Volumes on one of our servers is exhibiting very similar
> behavior.  It is a RAID machine using IDE drives presented to linux as
> SCSI via 3ware escalade hardware. The machine has several /vicepx
> partitions ranging in size from 100GB to 400GB.  The following shows up
> in our Filelogs:
> 
> Fri Jul 26 13:19:24 2002 ReallyRead(): read failed device 1A inode 
> 1777162977807295 errno 5
> Fri Jul 26 13:19:24 2002 ReallyRead(): read failed device 1A inode 
> 1777162977807295 errno 5
> Fri Jul 26 13:19:27 2002 ReallyRead(): read failed device 1A inode 
> 2503532141879063 errno 5
> Fri Jul 26 13:19:27 2002 ReallyRead(): read failed device 1A inode 
> 2503532141879063 errno 5
> 
> application servers accessing afs report a "File too large" type message
> when attempting to write to various volumes.  An ls reports no files
> (not even . and ..)  but sometimes the files are visible.  In both cases
> I have manually attempted to cd into these volumes and touch a tempfile
> and get a message that says "file too large".  So far the only solution
> I've found is to shutdown the server and run a salvage.  Several times I
> have had to reboot the machine entirely.  Is there anything else I
> should look for in order to track this down?  I've also noticed callback
> failures on a frequent basis on many fileservers but I'm not sure if
> this is related.
> 
>