[OpenAFS] File server, bos salvage hang

Miles Davis miles@cs.stanford.edu
Fri, 5 Nov 2004 20:14:12 -0800


Over the past couple of days, one of my file servers (RedHat Linux 9, 
openafs 1.2.11, nothing custom, LD_ASSUME_KERNEL=2.4.1 is set) has 
developed an annoying problem.

On occasion, we have the classic gconf problem, where for reasons I don't 
know (but have heard have been fixed in 1.3.X) where a user's gconf lock 
file .gconfd/lock/ior becomes corrupt and/or unusable, requiring a salvage 
of the volume. Normally, not a big deal, it happens only rarely. However, 
I've got a file server that I can no longer salvage volumes on; Running 
bos salvage <server> <part> <vol> never finishes, and the file server is 
never quite the same again until a restart (killing the file server) or 
reboot. By "never quite the same" I mean things like 'vol listvol' fails, 
though the file server it sill working for volume other than the one being 
salvaged. I haven't seen this behaviour with any of our other file 
servers, ever.

Before starting another salvage, I turned on logging via kill -TSTP 
<fileserver>, but I don't see anything standing out. Maybe somebody else 
does. I let it run in debug for about 10 minutes and then turned it off 
again. The result is at http://cs.stanford.edu/people/miles/FileLog.


-- 
// Miles Davis - miles@cs.stanford.edu - http://www.cs.stanford.edu/~miles
// Computer Science Department - Computer Facilities
// Stanford University