[OpenAFS] AFS CopyOnWrite problem

Marc Santoro msantoro@pobox.com
Thu, 18 Mar 2004 22:56:29 -0600 (CST)


I spent some time with 'strace' and judicious use of grep and the source
and found a pseudo-resolution to the problem.

Apparently, for some reason, AFS is trying to create new files over
existing ones; I stuck a strace on all the fileserver processes, and
grep'd for EEXIST, which was the errno=17 reported in the log. Every time
I tried to create/remove/rename a file in the directory, I'd see something
like this pop up in the strace output:

open("/vicepa/AFSIDat/4/4+++U/+/+/T5++2kg", O_RDWR|O_CREAT|O_TRUNC|O_EXCL,
0) = -1 EEXIST (File exists).

Upon moving all files that showed this error (there were only three), all
symptoms went away. One of these files contained the contents of a PINE
debug file; another was a binary file that looked like a directory (?).
There seemed to be no data loss after moving the files, however. No
problems appeared when I salvaged the partition.

I don't know how this happened; maybe AFS thought those file names were
good for new vnodes and tried to use them, and just punted when those file
names turned out to be in use. So, crud left around, no longer referenced,
but still occupying files? I don't know why the salvage operation wouldn't
catch that, though . . .

A kludge solution might be to check for the existance of a file before it
is opened with O_EXCL, and move it out of the way if it is there.