[OpenAFS] Re: odd problem with RW site after a botched replica

Wed, 31 Oct 2012 10:29:20 -0500

On Tue, 30 Oct 2012 20:07:57 -0700
Timothy Balcer <timothy@telmate.com> wrote:

> In other news, the latest salvage has been running for 12 hours... I
> straced the busiest pid and it is happily verifying all the links and
> contents (open(), close(), pread() ad infinitum), so its not wedged.
> This volume has literally slightly less than 32k directory entries in
> various places (yes, I made SURE the limits were observed ;-) ) and so
> I imagine it will take a very long time to traverse the entire
> thing... interesting that this is the fourth salvage and it actually
> seems to be working at it this time. Last three times it stopped after
> a bit over an hour.

I am just curious; does the machine seem to be cpu-bound during this
process? There has been some work done to parallelize this, so in the
future this could be faster (if, among other things, it seems cpu-bound
and you have multiple cores).

> I'll keep you all posted. There wasn't an error in the AFS logs that
> indicated that salvager proceses had been killed due to OOM. It was
> only in the kernel logs.

If you started this via 'bos salvage', there should be something in
BosLog to say that it was killed by signal 9.

-- 
Andrew Deason
adeason@sinenomine.net