[OpenAFS] help! volume corruption caused by lots of files

steve rader rader@ginseng.hep.wisc.edu
Sat, 07 Dec 2002 11:23:22 -0600


I have an end user who running a script under linux (2.2 or 2.4) that
creates directories with perhaps as many as 100,000 files in them.

The result is the underlying afs volume gets corrupt.  The first
few runs caused the client (openafs 1.2.6) to lockup.  The third
(last) run caused the volume to go offline--vos examine said "****
Could not attach volume N ****"

bos salvage has brought the volume back online and I notice that...

 1) 40 vnodes have version issues...

  bash# egrep 'version < inode version' SalvageLog | wc -l
         40
  bash# egrep 'version < inode version' SalvageLog | tail -1
  12/07/2002 10:39:04 Vnode 1585980: version < inode version; fixed (old status)

 2) ~30000 lots of vnodes have missing nodes...

  bash# egrep 'inode [0-9]+ is missing' SalvageLog | wc -l
     30473
  bash# egrep 'is missing' SalvageLog | tail -1
  12/07/2002 10:39:49 Vnode 2164190 (unique 4221376): \
    corresponding inode 1697592 is missing; vnode deleted, \
    vnode mod time=Fri Dec  6 16:07:34 2002
 
 3) the same number of vnodes have invalid entries...
 
  bash# egrep 'invalid entry' SalvageLog | wc -l
     30473
  bash# egrep 'invalid entry' SalvageLog | tail -1
  12/07/2002 11:03:59 dir vnode 220075: invalid entry: \
    ./some/path/somefile (vnode 2164176, unique 4221369)

At first, I thought the source of the problem was Linux 2.2's
inode-max, so I set it to 32K.  But then the script failed again so I
had the end-user run the script under Linux 2.4 (which automagically
handles inode-max, right?)  No cigar.

Is it possible that the first run (under 2.2) cause the volume
to get corrupt (but not offline) and then the second run killed
the volume?  Is so, perhaps I should have the user run the script
under 2.4 after a salvage?  (Notice the suspicious ~30,500 figure
is pretty close to 32768 minus over head.)

Other ideas?  I'd really like to avoid telling the end user to just
write to local disk! =:)  AFS being preceived as not-super-stable
is a political issue here.  I'd like to figure out a cure and a
cogent explaination to this problem ASAP.

Tia anyone!

steve 
- - - 
systems & network guy
high energy physics
university of wisconsin