[OpenAFS] Salvage and .gconf lock and other problems with OpenAFS
1.2.10 and 1.2.11
Derrick J Brashear
shadow@dementia.org
Fri, 19 Mar 2004 18:05:40 -0500 (EST)
On Fri, 19 Mar 2004, Renata Maria Dart wrote:
> We have been happily running OpenAFS fileservers uneventfully for a
> number of months until one of them, afs09, crashed in January. Our
> fileservers are Sun 280R systems each with 400gb of storage, running
> solaris 9 and a mix of OpenAFS 1.2.9 and 1.2.10, and now 1.2.11 on the
> problem fileserver. Since the crash, we have had a number of further
> disturbing problems with that same fileserver:
>
> 1. Afs09 crashed on January 22. It was running 1.2.9 at that time.
> I have the corefile.fs from that crash but I have not been able to
> figure out the cause. After it restarted it was then running 1.2.10.
> I have the core and the fileserver in case someone has time to look at it.
As it happens a problem which would affect Solaris fileservers was
introduced in 1.2.9 and fixed in 1.2.10
> 2. After that crash, a number of our users found that they had
> an unremovable gnome lock file, <home dir>/.gconfd/lock/ior. This
> prevented them from starting up a gnome session. A bos salvage
> of the user's home directory volume fixed the problem (despite the
> fact that salvage already ran as a result of the crash) and left the
> following in SalvageLog:
I find it odd you didn't have this one before, so far no one can offer a
simple test case for making it happen, it's always "run gconf on 2
machines and pray"
> 3. Last week we had a meltdown on this fileserver....the idle entries
> in the output from the meltdown script dropped to 2 while the wproc
> jumped up to the 7000s. I could not find any one culprit with snoop
> and eventually tried to restart the system. After 40 minutes,
> when it didn't restart, I ended up killing the fileserver and
> restarting it. I unfortunately did not get any showproc output or a
> fileserver core...I will try to get that if we experience another one.
> There was nothing in the FileLog around the time of the meltdown.
with no information i can't really guess that one.
[]
> So, I did a ctl-C to stop it. At this point the fileserver process
> jumped up to use 100% of of one of our cpus (there are 2 on a 280R).
> I noticed that there was still a salvage-tmp process in the output
> of bos status and there were 2 salvage processes running. So, I did
> a bos shutdown of salvage-tmp. But, the salvage processes continued
> to run, accumulating no cpu time. So I killed these 2 processes.
> But the fileserver continued to take all of one cpu. Top showed
> it to be all in user, very little in kernel. While the fileserver
> was using so much cpu, no vos commands would complete.
>
> I ended up doing a bos shutdown, which responded quickly and cleanly.
> I upgraded to 1.2.11 and rebooted the system.
no guesses on this one, either, at least not without more information.
> 5. This week, a few users were still reporting .gconf lock fallout
> and we had a second bos salvage that wouldn't complete. And again
> the ctl-C caused a high fileserver cpu situation where no vos commands
> could complete. This time I trussed the two salvage processes:
>
> wait() (sleeping...)
that's the parent.
> and the other was:
>
> read(6, 0xFFBFF4D3, 1) (sleeping...)
what was fd 6 attached to?
> Does anyone have any ideas about the .gconf lock file being unremovable
> after a fileserver crash? Should that hard link be there in a normal
*if* the problem is caused by a fileserver restart (crash would be a
subset of that) perhaps it would make sense. peoploe have reported this
problem without a fileserver crash and i didn't think of "did the
fileserver restart".
basically, a reproducible test case would go miles towards finding an
answer.