[OpenAFS] corruption problems
Nicolescu, Edward L
edward@bnl.gov
Fri, 26 Nov 2004 13:06:57 -0500
Date: Thu, 25 Nov 2004 17:41:55 -0500 (EST)
From: Derrick J Brashear <shadow@dementia.org>
To: "'openafs-info@openafs.org'" <openafs-info@openafs.org>
Subject: Re: [OpenAFS] corruption problems
On Thu, 25 Nov 2004, Nicolescu, Edward L wrote:
>> Folks,
>>
>> The problem I am about to describe occurs on any afs client/server
>> combination........
>Can you give a list of specific server versions (platform and
>version) you tried?
Hi Derrick,
Servers: IBM AFS 3.6/AIX5.2 and OpenAFS 1.2.11/SunOS5.9
Clients: all of them, whether, AIX, SunOS, or RedHat
Meanwhile I have unreplicated the volume (this being its original state; I
had replicated
it just to see if it made a difference or not) and split it in two. It looks
a bit better now,
being able to create/delete files in a continuous "while" loop without
corrupting the
volumes. The difference made by splitting the volume is in the number of
files per
volume. Before, there were over 32,000 files in one volume. I found an
e-mail of yours in
the openafs-info archive saying that this high a number of files could cause
problems (see
below). Is this the limit imposed by "max-files-per-dir" ?
But we still have problems when trying to "move" a file inside the same
directory to a different name. I have appended below a complete
description of the problem as sent to me by a user, this morning. The
corruption of the volumes (now two as opposed to one, before) relates
to,
Vnode "N": version < inode version; fixed (old status)
There are 518 vnodes exhibiting the version mismatch problem in one
volume and 519 in the other one.
Thanks. -Edward
---------------------------- original message ----------------------------
>>
>>My thanks go out to Rubino <kb44 @ rz.uni-karlsruhe.de> for
>serious >pointing out the max-files-per-dir info for AFS! [1]
>>
>>I have identified a directory with exactly 32,000 files...
>>31,998 of them are 20 characters in length. Crazy physicists.
>>
>>
> (answer from Derrick)
>We had a usage pattern which would consistently allow 31707 (I think)
>files and then fail.
------------------------------------------------------------------------
---------- description of problem as sent by an user ------------------
o I started with a cleaup of the Qt directory
% find . -name '*.o' -exec rm {} \;
this worked (no hang-up) and further reduced the amount of files
(done in two directories icc/gcc). So, removing files works fine.
o I tried again things like date >>blabla (blabla an existing file)
and this worked as well.
o Then, went to lynx 'make install'. The first command
mv -f /afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/lynx
/afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/lynx.old
freezes. I infer that it is not the process of removing/deletig
a file or even updating a file which is the problem but the
process of adding one. While this command hangs, from
another client, I can
% ls -l /afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/lynx
ls: /afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/lynx: No such file
or directory
which confirms that the orginal file 'lynx' was deleted
successfully BUT
% ls -l /afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/lynx.old
would hang, and this, even from another client. In fact,
as described before, any 'stat()' like command (ls -l needing
to get file properties) would equally hang at this stage. For
example, 'ls -F' (a typical alias) on
/afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/ gets the session to
freeze from ANY client (I won't try them all, rcas6002 and rcas6001
are busted).
o NOW: if I do the same 'ls' from my Desktop machine, RH9, OpenAFS
1.2.11-rh9.0.1, a similar 'ls' would also freeze my DeskTop
% jobs
[1] + Running ls -alrt
/afs/rhic.bnl.gov/opt/star/sl302_gcc323/bin/
("Running" state for the past 10 mnts already). I would venture
that this is true for any Linux client around and beyond.
o In fact, to illustrate the above, I went to Stony Brook on a
True64 Unix client running a Transarc AFS client 3.6 and
a 'ls -F' command is frozen
% jobs
[1] + Running ls -F
/afs/rhic.bnl.gov/i386_sl302/opt/star/sl302_gcc323/bin/
The conclusion being that if such problem is created on this
volume, a cascading effect would freeze all clients world-wide
after any command issued on the directory in question, and this
for 12 hours or more.
To date, it is the oddest AFS behavior I have seen.