[OpenAFS-devel] Interesting problems/info related to an old recurring problem with fileservers

Neulinger, Nathan nneul@umr.edu
Fri, 27 Aug 2004 20:51:40 -0500


Another possibility would be to change this:

    if (read(h->fd_fd, (char *)&row, sizeof(row)) !=3D sizeof(row)) {
        goto bad_getLinkByte;
    }

in GetLinkCount... have it treat "can't get record, i.e. no record
exists" the same as a link count of zero, and just return 0 there
instead of treating as an error condition.

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-6679
UMR Information Technology             Fax: (573) 341-4216
=20

> -----Original Message-----
> From: openafs-devel-admin@openafs.org=20
> [mailto:openafs-devel-admin@openafs.org] On Behalf Of Nathan Neulinger
> Sent: Friday, August 27, 2004 8:02 PM
> To: openafs-devel@openafs.org
> Subject: [OpenAFS-devel] Interesting problems/info related to=20
> an old recurring problem with fileservers
>=20
> I occasionally have a problem with iinc failed messages that=20
> I believe are
> the result of failed cloning operations due to=20
> failures/timeouts during
> volser ops related to our backup facility. I'm thinking that=20
> something is
> causing a ton of clones to get left around and never cleaned=20
> up. Be that
> as it may, I started doing some testing and reproduced an=20
> interesting failure=20
> case.
>=20
> Here's an interesting tidbit I found today after some testing=20
> by interupting
> a vos dump -clone in the middle of deleting a clone on a test=20
> fileserver.
>=20
> One of the volume "special" files is missing or is basically=20
> empty. It gets
> recreated repeatedly with the 8 byte size if I remove it as=20
> well... This
> results in the namei_GetLinkCount call failing, likely due to=20
> a read after
> a seek into the file returning no data.=20
>=20
> Salvager is unable to correct this situation.
>=20
> Adding the call to SetLinkCount below did allow the salvager errors to
> go away, but it's probably not a good idea since there is no=20
> distinguishing=20
> between "failed cause it can't find that file in the=20
> linkCount table" and=20
> "failed for some other reason".=20
>=20
>     if ((count =3D namei_GetLinkCount(fdP, ino, 1)) < 0)
>         {
>                 printf("failed getLinkCount - 1\n");
>                 if ( namei_SetLinkCount(fdP, ino, 1, 0) < 0 )
>                 {
>                         printf("also failed setLinkCount\n");
>                 }
>         code =3D -1;
>         }
>=20
> The part that I am concerned about is - how is this situation=20
> occurring in the
> first place. I wouldn't think that interrupting a client=20
> operation would be
> able to cause the server to corrupt a volume on disk in this=20
> low-level a=20
> manner. Obviously it could leave temp clones around and all=20
> that, but I
> wouldn't think a basic special file would get corrupted.
>=20
> The main reason this is a problem is that when a volume gets=20
> to this state,
> the file server is unable to do iinc's on any inode in the=20
> volume, returning
> errors to the client that there is no space left on device.
>=20
> I'm wondering if perhaps it would be reasonable to have=20
> GetLinkCount return
> a special error value to indicate to the caller that the=20
> caller should try to=20
> repair the situation and forcibly set the link count even=20
> though one wasn't=20
> able to be returned.
>=20
> Another way of looking at it - regardless of whether or not a=20
> 0 link count
> should mean that the file isn't supposed to exist, if you did=20
> an iinc on a
> file with no links, a more "useful" result would be a link=20
> count of 1, even
> though the situation shouldn't occur. It would at least be a=20
> better way
> of recovering from the error condition in a manner that=20
> wouldn't make the
> volume unrecoverable.=20
>=20
> If anyone wants to take a closer look at the data, I have a tarball of
> the /vicepa contents (only about 2.6 mb compressed, probably=20
> could be shrunk=20
> down even further if I truncated some of the actual files).=20
> It's a partition
> with a single volume in it.
>=20
> -- Nathan
>=20
> ------------------------------------------------------------
> Nathan Neulinger                       EMail:  nneul@umr.edu
> University of Missouri - Rolla         Phone: (573) 341-6679
> UMR Information Technology             Fax: (573) 341-4216
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>=20
>=20
>=20