[OpenAFS] Re: odd problem with RW site after a botched replica

Tue, 30 Oct 2012 23:11:53 -0400

On Tue, Oct 30, 2012 at 11:07 PM, Timothy Balcer <timothy@telmate.com> wrote:
>
>
> On Tue, Oct 30, 2012 at 7:33 AM, Kim Kimball <kim@thekimballs.com> wrote:
>>
>> If you have access to a recent RO the quickest fix may be to vos dump it
>> and restore the RW from it.  NB that if there is only one RO currently
>> available dumping it makes it busy and with no alternate the RO will be
>> unavailable to all clients.
>>
>
> Thanks for that Tip, however in my efforts to get the RW site functioning, I
> removed the RO replica.
>
> In other news, the latest salvage has been running for 12 hours... I straced
> the busiest pid and it is happily verifying all the links and contents
> (open(), close(), pread() ad infinitum), so its not wedged. This volume has
> literally slightly less than 32k directory entries in various places (yes, I
> made SURE the limits were observed ;-) ) and so I imagine it will take a
> very long time to traverse the entire thing... interesting that this is the
> fourth salvage and it actually seems to be working at it this time. Last
> three times it stopped after a bit over an hour.
>
> I suspect that the resources given to the afs server were too limited to
> actually get the salvage done properly. One thing I did this time was
> increase the memory to the server up to 8GB, and free shows it tooling
> merrily along with plenty of buffers and cache now.
>
> I did THAT because I noticed that the kernel killed the salvage operation
> the first two times due to out of memory conditions.. something I had not
> checked, or expected. So it may be that this is the second "true" salvage,
> and it may succeed.
>
> I'll keep you all posted. There wasn't an error in the AFS logs that
> indicated that salvager proceses had been killed due to OOM. It was only in
> the kernel logs.

The OOM killer *is* the kernel, so the AFS logs just know it's dead,
not that the kernel
decided "heeeey...."

-- 
Derrick