[OpenAFS] clients hang when server crashes

Steve Devine sdevine@msu.edu
Wed, 28 Sep 2005 16:11:25 -0400


Today we had a weird failure that ended up affecting many of our campus 
services.
A fileserver that holds nothing but user volumes became unresponsive.
This fileserver is physically in another building  / different subnet.
Our web servers that mount /afs/msu/web-volumes went south. A ls of 
/afs/ came back empty.
The volumes that the web server mount are located on totally different 
fileservers.
Rebooting this remote fileserver returned normal operation. In fact as  
soon as the server had been hard reset the cell
came back to normal even before this server was back online.

I think this is related to a problem we had about 6 months ago.
We in an effort to provide Disaster Recovery put a root.cell.readonly 
and root.afs.readonly on this remote fileserver.
This was proving to be troublesome due to network issues so we moved 
root.afs.readonly and root.cell.readonly back onto a server within our 
own building. This was done over 6 months ago.
So after this long story here's my question:
Can I query the local cache and find out where the client thinks 
root.cell .readonly is? My theory is the clients (mostly Solaris ) think 
the root.* volumes are still on this remote fileserver and when the 
server gets wedged the clients hang. Why these clients can't find the 
real volumes is beyond me.
vos exam root.cell tells me that these volumes are not on the affected 
fileserver and that they are where I expect them to be.
Any thoughts on why this is happening.

-- 
Steve Devine
Storage Systems
Academic Computing & Network Services
Michigan State University

506 Computer Center
East Lansing, MI 48824-1042
1-517-432-7327

Baseball is ninety percent mental; the other half is physical.
- Yogi Berra