[OpenAFS] corruption problems

Fri, 26 Nov 2004 13:06:57 -0500

Date: Thu, 25 Nov 2004 17:41:55 -0500 (EST) 
From: Derrick J Brashear <shadow@dementia.org> 
To: "'openafs-info@openafs.org'" <openafs-info@openafs.org> 
Subject: Re: [OpenAFS] corruption problems 
On Thu, 25 Nov 2004, Nicolescu, Edward L wrote: 
>> Folks, 
>> 
>> The problem I am about to describe occurs on any afs client/server 
>> combination........
>Can you give a list of specific server versions (platform and 
>version) you tried? 

Hi Derrick,

Servers: IBM AFS 3.6/AIX5.2 and OpenAFS 1.2.11/SunOS5.9
Clients: all of them, whether, AIX, SunOS, or RedHat 

Meanwhile I have unreplicated the volume (this being its original state; I
had replicated 
it just to see if it made a difference or not) and split it in two. It looks
a bit better now, 
being able to create/delete files in a continuous "while" loop without
corrupting the 
volumes. The difference made by splitting the volume is in the number of
files per 
volume. Before, there were over 32,000 files in one volume. I found an
e-mail of yours in 
the openafs-info archive saying that this high a number of files could cause
problems (see
below). Is this the limit imposed by "max-files-per-dir" ?

But we still have problems when trying to "move" a file inside the same
directory to a different name. I have appended below a complete
description of the problem as sent to me by a user, this morning. The
corruption of the volumes (now two as opposed to one, before) relates
to,

Vnode "N": version < inode version; fixed (old status)

There are 518 vnodes exhibiting the version mismatch problem in one
volume and 519 in the other one.

Thanks. -Edward

---------------------------- original message ----------------------------
>>
>>My thanks go out to Rubino <kb44 @ rz.uni-karlsruhe.de> for
>serious >pointing out the max-files-per-dir info for AFS!  [1]
>>
>>I have identified a directory with exactly 32,000 files...
>>31,998 of them are 20 characters in length.  Crazy physicists.
>>    
>>
> (answer from Derrick)
>We had a usage pattern which would consistently allow 31707 (I think) 
>files and then fail. 
------------------------------------------------------------------------

---------- description of problem as sent by an user ------------------
o I started with a cleaup of the Qt directory 
  % find . -name '*.o' -exec rm {} \;
  this worked (no hang-up) and further reduced the amount of files
 (done in two directories icc/gcc). So, removing files works fine.
o I tried again things like date >>blabla  (blabla an existing file)
  and this worked as well.
o Then, went to lynx 'make install'. The first command 
mv -f /afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/lynx 
/afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/lynx.old
  freezes. I infer that it is not the process of removing/deletig
  a file or even updating a file which is the problem but the
  process of adding one. While this command hangs, from 
  another client, I can
  % ls -l /afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/lynx 
  ls: /afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/lynx: No such file
  or directory
  which confirms that the orginal file 'lynx' was deleted 
  successfully BUT
  %  ls -l /afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/lynx.old
  would hang, and this, even from another client. In fact, 
  as described before, any 'stat()' like command (ls -l needing 
  to get file properties) would equally hang at this stage. For
  example, 'ls -F' (a typical alias) on 
  /afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/ gets the session to
  freeze from ANY client (I won't try them all, rcas6002 and rcas6001
  are busted).
o NOW: if I do the same 'ls' from my Desktop machine, RH9, OpenAFS
  1.2.11-rh9.0.1, a similar 'ls' would also freeze my DeskTop
 % jobs
 [1]  + Running                       ls -alrt  
 /afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/
  ("Running"  state for the past 10 mnts already). I would venture 
  that this is true for any Linux client around and beyond. 
o In fact, to illustrate the above, I went to Stony Brook on a
  True64 Unix client running a Transarc AFS client 3.6 and 
  a 'ls -F' command is frozen
  % jobs
  [1]  + Running                       ls -F
  /afs/rhic.bnl.gov/i386_sl302/opt/star/sl302_gcc323/bin/
  The conclusion being that if such problem is created on this
  volume, a cascading effect would freeze all clients world-wide
  after any command issued on the directory in question, and this
  for 12 hours or more.
 To date, it is the oddest AFS behavior I have seen.