[OpenAFS] Salvage and .gconf lock and other problems with OpenAFS 1.2.10 and 1.2.11

Renata Maria Dart Renata Maria Dart <renata@SLAC.Stanford.EDU>
Fri, 19 Mar 2004 13:13:27 -0800 (PST)


We have been happily running OpenAFS fileservers uneventfully for a 
number of months until one of them, afs09, crashed in January.   Our 
fileservers are Sun 280R systems each with 400gb of storage, running 
solaris 9 and a mix of OpenAFS 1.2.9 and 1.2.10, and now 1.2.11 on the 
problem fileserver.  Since the crash, we have had a number of further 
disturbing problems with that same fileserver:

1.  Afs09 crashed on January 22.  It was running 1.2.9 at that time.
I have the corefile.fs from that crash but I have not been able to 
figure out the cause.  After it restarted it was then running 1.2.10.
I have the core and the fileserver in case someone has time to look at it.   

2.  After that crash, a number of our users found that they had
an unremovable gnome lock file, <home dir>/.gconfd/lock/ior.  This 
prevented them from starting up a gnome session.   A bos salvage
of the user's home directory volume fixed the problem (despite the 
fact that salvage already ran as a result of the crash) and left the 
following in SalvageLog:

03/19/2004 09:26:46 dir vnode 157: ./.gconfd/lock/ior (vnode 230): unique change
d from 1519582 to 1519806 
03/19/2004 09:26:46 dir vnode 157: ./.gconfd/lock/ior already claimed by directo
ry vnode 1 (vnode 230, unique 1519582) -- deleted
03/19/2004 09:26:46 dir vnode 845: ./.gconf/%gconf-xml-backend.lock/ior (vnode 1
198): unique changed from 1519585 to 1519736 
03/19/2004 09:26:46 dir vnode 845: ./.gconf/%gconf-xml-backend.lock/ior already 
claimed by directory vnode 571 (vnode 1198, unique 1519585) -- deleted

When I look at a "normal" .gconfd directory today I see a hard link:

renata@afs09 $ 9:46 cd .gconfd/lock
renata@afs09 $ 9:46 ls -lAi
total 0
 694881134 -rwx------   2 ljm      sf             0 Mar 19 09:31 .__afsE87F*
 694881134 -rwx------   2 ljm      sf             0 Mar 19 09:31 ior*
 
Is this part of the problem?
 
 
3.  Last week we had a meltdown on this fileserver....the idle entries
in the output from the meltdown script dropped to 2 while the wproc
jumped up to the 7000s.  I could not find any one culprit with snoop
and eventually tried to restart the system.  After 40 minutes,
when it didn't restart, I ended up killing the fileserver and 
restarting it.  I unfortunately did not get any showproc output or a 
fileserver core...I will try to get that if we experience another one.
There was nothing in the FileLog around the time of the meltdown.

4.  Again there were .gconf lock problems following this incident.  While 
repairing some of the .gconf lock problems, one our bos salvages refused 
to complete.  It went on for 2 pages of 

   bos: waiting for salvage to complete.

The SalvageLog showed:

@(#) OpenAFS 1.2.11 built  2004-01-10 
03/18/2004 16:40:58 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepf 
536874785)

So, I did a ctl-C to stop it.  At this point the fileserver process
jumped up to use 100% of of one of our cpus (there are 2 on a 280R).
I noticed that there was still a salvage-tmp process in the output
of bos status and there were 2 salvage processes running.  So, I did 
a bos shutdown of salvage-tmp.  But, the salvage processes continued
to run, accumulating no cpu time.   So I killed these 2 processes.
But the fileserver continued to take all of one cpu.  Top showed
it to be all in user, very little in kernel.  While the fileserver
was using so much cpu, no vos commands would complete.  

I ended up doing a bos shutdown, which responded quickly and cleanly.
I upgraded to 1.2.11 and rebooted the system.


5.  This week, a few users were still reporting .gconf lock fallout
and we had a second bos salvage that wouldn't complete.  And again
the ctl-C caused a high fileserver cpu situation where no vos commands 
could complete.  This time I trussed the two salvage processes:

   wait()                          (sleeping...)

and the other was:

   read(6, 0xFFBFF4D3, 1)          (sleeping...)

I have showProcInfo output, again if anyone has the time or inclination
to look at it, please let me know.

Does anyone have any ideas about the .gconf lock file being unremovable
after a fileserver crash?  Should that hard link be there in a normal
.gconf/lock directory?  The salvager problem sounds like a bug.  Is this 
a known problem?  If not, is there some  information that I can gather that 
would help you to debug that?   

Thanks for any help you can provide.  

Renata




 Renata Dart                         | renata@SLAC.Stanford.edu  
 Stanford Linear Accelerator Center  |    
 2575 Sand Hill Road, MS 97          | (650) 926-2848 (office)
 Stanford, California   94025        | (650) 926-3329 (fax)

------------- End Forwarded Message -------------


 Renata Dart                         | renata@SLAC.Stanford.edu  
 Stanford Linear Accelerator Center  |    
 2575 Sand Hill Road, MS 97          | (650) 926-2848 (office)
 Stanford, California   94025        | (650) 926-3329 (fax)