[OpenAFS] Re: Crazy DAFS problem (with log)

Andrew Deason adeason@sinenomine.net
Sun, 20 Mar 2011 19:42:12 -0500

On Sun, 20 Mar 2011 16:50:16 -0500
"Ryan C. Underwood" <nemesis-lists@icequake.net> wrote:

> Shortly after the weekly scheduled fileserver restart, things blew up
> in a big way.  My RW root.cell was inaccessible in the end.  No kernel
> messages indicating filesystem or disk problems underneath.  I
> force-fscked the vice partition (ext4) and no problems were found.
> Any speculation on what in the world happened here?  This is running
> -pre3 with the patch from Andrew mentioned earlier.

What's in SalsrvLog and SalsrvLog.old? It should have a bunch of stuff
on the salvages here. Also, BosLog for this time period would be good
for completeness.

I am also assuming there's no SalvageLog* files with entries anywhere
near this time. If that's not true, include those.

> Sun Mar 20 04:00:30 2011 File Server started Sun Mar 20 04:00:30 2011
> Sun Mar 20 04:02:02 2011 Scheduling salvage for volume 536870915 on part /vicepa over SALVSYNC

There are not many places that VRequestSalvage_r can get called without
logging something about why... I think the only places are when the
volume header flags say that the volume needs salvaging.

> Sun Mar 20 04:04:01 2011 VAttachVolume: Error reading diskDataHandle header for vol 536870916; error=101
> Sun Mar 20 04:04:01 2011 Scheduling salvage for volume 536870916 on part /vicepa over SALVSYNC

"volume metadata is corrupt for 536870916"

> Sun Mar 20 04:14:14 2011 CopyOnWrite failed: volume 536871273 in partition /vicepa  (tried reading 8192, read 0, wrote 0, errno 0) volume needs salvage

Someone wrote to a file in 536871273, and we needed to CoW the file. But
we couldn't read all of the data from the file to make a new copy. Based
on those values, I'd guess the file is empty on disk, but it's supposed
to contain some data.

With a patch on the master branch, such a situation should be caught
earlier and more information logged. But that wouldn't even really help
too much if that were the case. You can find files like this by doing
'volinfo /vicepa 536871273 -filenames', and seeing the alleged length
for each vnode, and seeing if the actual file size on disk of the
reported filenames match.

However, salvaging is supposed to fix that, so it would be good to see
what the salvage logs say.

Andrew Deason