[OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

Andy Cobaugh phalenor@gmail.com
Fri, 4 Mar 2011 19:42:04 -0500 (EST)


On 2011-03-04 at 16:30, Andrew Deason ( adeason@sinenomine.net ) said:
> On Fri, 4 Mar 2011 17:20:34 -0500 (EST)
> Andy Cobaugh <phalenor@gmail.com> wrote:
>
>>> The first issue you reported had problems much earlier before the
>>> log messages you gave. Did anything happen to the backup volume
>>> before that?  No messages referencing that volume id? Did you or
>>> someone/thing else remove the backup clone or anything?
>>
>> Nope. We don't even access the backup volume when doing the file-level
>> backups anymore.
>
> Well, _something_ deleted it, unless it didn't exist before 1 mar 2011.
> This message

It certainly did exist before that, and nothing I did and no part of our 
backup system would have delete it.

> Tue Mar  1 00:02:12 2011 VReadVolumeDiskHeader: Couldn't open header for volume 536871061 (errno 2)
>
> means the volume doesn't exist. It's not that it's corrupt or anything;
> the volume was completely deleted. (or something just deleted the .vol
> header, but the other messages suggest it was deleted normally)

What does 'deleted normally' mean in this context? Nothing touched the 
volume since the previous night, where it created the .backup volume just 
fine. Unfortunately, those logs have since rolled over, so I don't have 
anything older than from when I restarted the fileserver at 16:12 on Mar 
1.

>> Yes, the zaps were me trying to get the .backup into a usable state.
>> Though, the first string of salvages started in the middle of the
>> afternoon without any intervention - I think the event that caused
>> them is what's missing from the picture.
>
> Well, do you have the messages from around then?

Ugh, no. Hopefully I will if it happens again.

>> I'm still a little hesitant to bos salvage that server - whole reason
>> we're trying to switch to DAFS is to avoid the multi-hour fileserver
>> outages.
>
> Salvaging a single volume is the same as a demand-salvage; it is no
> slower and no more impactful than an automatically-triggered one. But
> you can manually trigger the salvage of a single volume group in cases
> like this (e.g. when the fileserver refuses to because it's been
> salvaged too many times).

Ok, I had to bos salvage the .backup volume directly with -forceDAFS. When 
I did this when this happened on my machine at home, it wasn't so easy. In 
that case, it was with an RO clone. I think I had to remsite, then remove 
or zap or some combination, along with manually deleting the .vol. I wish 
I had payed closer attention then.

I still have no idea what caused the volume to spontaneously need 
salvaging Tuesday afternoon. I did notice that until I fixed the BK 
volume, if I did a 'vos exam home.gsong.backup', that triggered a salvage.

Wish I had more to go on. I'll be working on standardizing our logging 
configuration across servers next week, logging via syslog, etc, so we 
don't lose valuable logs like this.

--andy