[OpenAFS-devel] Interesting problems/info related to an old recurring problem with fileservers

Nathan Neulinger nneul@umr.edu
Fri, 27 Aug 2004 20:02:09 -0500


I occasionally have a problem with iinc failed messages that I believe are
the result of failed cloning operations due to failures/timeouts during
volser ops related to our backup facility. I'm thinking that something is
causing a ton of clones to get left around and never cleaned up. Be that
as it may, I started doing some testing and reproduced an interesting failure 
case.

Here's an interesting tidbit I found today after some testing by interupting
a vos dump -clone in the middle of deleting a clone on a test fileserver.

One of the volume "special" files is missing or is basically empty. It gets
recreated repeatedly with the 8 byte size if I remove it as well... This
results in the namei_GetLinkCount call failing, likely due to a read after
a seek into the file returning no data. 

Salvager is unable to correct this situation.

Adding the call to SetLinkCount below did allow the salvager errors to
go away, but it's probably not a good idea since there is no distinguishing 
between "failed cause it can't find that file in the linkCount table" and 
"failed for some other reason". 

    if ((count = namei_GetLinkCount(fdP, ino, 1)) < 0)
        {
                printf("failed getLinkCount - 1\n");
                if ( namei_SetLinkCount(fdP, ino, 1, 0) < 0 )
                {
                        printf("also failed setLinkCount\n");
                }
        code = -1;
        }

The part that I am concerned about is - how is this situation occurring in the
first place. I wouldn't think that interrupting a client operation would be
able to cause the server to corrupt a volume on disk in this low-level a 
manner. Obviously it could leave temp clones around and all that, but I
wouldn't think a basic special file would get corrupted.

The main reason this is a problem is that when a volume gets to this state,
the file server is unable to do iinc's on any inode in the volume, returning
errors to the client that there is no space left on device.

I'm wondering if perhaps it would be reasonable to have GetLinkCount return
a special error value to indicate to the caller that the caller should try to 
repair the situation and forcibly set the link count even though one wasn't 
able to be returned.

Another way of looking at it - regardless of whether or not a 0 link count
should mean that the file isn't supposed to exist, if you did an iinc on a
file with no links, a more "useful" result would be a link count of 1, even
though the situation shouldn't occur. It would at least be a better way
of recovering from the error condition in a manner that wouldn't make the
volume unrecoverable. 

If anyone wants to take a closer look at the data, I have a tarball of
the /vicepa contents (only about 2.6 mb compressed, probably could be shrunk 
down even further if I truncated some of the actual files). It's a partition
with a single volume in it.

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-6679
UMR Information Technology             Fax: (573) 341-4216