[OpenAFS] Salvage and .gconf lock and other problems with OpenAFS 1.2.10 and 1.2.11

Derrick J Brashear shadow@dementia.org
Fri, 19 Mar 2004 18:05:40 -0500 (EST)


On Fri, 19 Mar 2004, Renata Maria Dart wrote:

> We have been happily running OpenAFS fileservers uneventfully for a
> number of months until one of them, afs09, crashed in January.   Our
> fileservers are Sun 280R systems each with 400gb of storage, running
> solaris 9 and a mix of OpenAFS 1.2.9 and 1.2.10, and now 1.2.11 on the
> problem fileserver.  Since the crash, we have had a number of further
> disturbing problems with that same fileserver:
>
> 1.  Afs09 crashed on January 22.  It was running 1.2.9 at that time.
> I have the corefile.fs from that crash but I have not been able to
> figure out the cause.  After it restarted it was then running 1.2.10.
> I have the core and the fileserver in case someone has time to look at it.

As it happens a problem which would affect Solaris fileservers was
introduced in 1.2.9 and fixed in 1.2.10

> 2.  After that crash, a number of our users found that they had
> an unremovable gnome lock file, <home dir>/.gconfd/lock/ior.  This
> prevented them from starting up a gnome session.   A bos salvage
> of the user's home directory volume fixed the problem (despite the
> fact that salvage already ran as a result of the crash) and left the
> following in SalvageLog:

I find it odd you didn't have this one before, so far no one can offer a
simple test case for making it happen, it's always "run gconf on 2
machines and pray"

> 3.  Last week we had a meltdown on this fileserver....the idle entries
> in the output from the meltdown script dropped to 2 while the wproc
> jumped up to the 7000s.  I could not find any one culprit with snoop
> and eventually tried to restart the system.  After 40 minutes,
> when it didn't restart, I ended up killing the fileserver and
> restarting it.  I unfortunately did not get any showproc output or a
> fileserver core...I will try to get that if we experience another one.
> There was nothing in the FileLog around the time of the meltdown.

with no information i can't really guess that one.

[]

> So, I did a ctl-C to stop it.  At this point the fileserver process
> jumped up to use 100% of of one of our cpus (there are 2 on a 280R).
> I noticed that there was still a salvage-tmp process in the output
> of bos status and there were 2 salvage processes running.  So, I did
> a bos shutdown of salvage-tmp.  But, the salvage processes continued
> to run, accumulating no cpu time.   So I killed these 2 processes.
> But the fileserver continued to take all of one cpu.  Top showed
> it to be all in user, very little in kernel.  While the fileserver
> was using so much cpu, no vos commands would complete.
>
> I ended up doing a bos shutdown, which responded quickly and cleanly.
> I upgraded to 1.2.11 and rebooted the system.

no guesses on this one, either, at least not without more information.

> 5.  This week, a few users were still reporting .gconf lock fallout
> and we had a second bos salvage that wouldn't complete.  And again
> the ctl-C caused a high fileserver cpu situation where no vos commands
> could complete.  This time I trussed the two salvage processes:
>
>    wait()                          (sleeping...)

that's the parent.

> and the other was:
>
>    read(6, 0xFFBFF4D3, 1)          (sleeping...)

what was fd 6 attached to?

> Does anyone have any ideas about the .gconf lock file being unremovable
> after a fileserver crash?  Should that hard link be there in a normal

*if* the problem is caused by a fileserver restart (crash would be a
subset of that) perhaps it would make sense. peoploe have reported this
problem without a fileserver crash and i didn't think of "did the
fileserver restart".

basically, a reproducible test case would go miles towards finding an
answer.