[OpenAFS] clients hang when server crashes
Steve Devine
sdevine@msu.edu
Wed, 28 Sep 2005 16:11:25 -0400
Today we had a weird failure that ended up affecting many of our campus
services.
A fileserver that holds nothing but user volumes became unresponsive.
This fileserver is physically in another building / different subnet.
Our web servers that mount /afs/msu/web-volumes went south. A ls of
/afs/ came back empty.
The volumes that the web server mount are located on totally different
fileservers.
Rebooting this remote fileserver returned normal operation. In fact as
soon as the server had been hard reset the cell
came back to normal even before this server was back online.
I think this is related to a problem we had about 6 months ago.
We in an effort to provide Disaster Recovery put a root.cell.readonly
and root.afs.readonly on this remote fileserver.
This was proving to be troublesome due to network issues so we moved
root.afs.readonly and root.cell.readonly back onto a server within our
own building. This was done over 6 months ago.
So after this long story here's my question:
Can I query the local cache and find out where the client thinks
root.cell .readonly is? My theory is the clients (mostly Solaris ) think
the root.* volumes are still on this remote fileserver and when the
server gets wedged the clients hang. Why these clients can't find the
real volumes is beyond me.
vos exam root.cell tells me that these volumes are not on the affected
fileserver and that they are where I expect them to be.
Any thoughts on why this is happening.
--
Steve Devine
Storage Systems
Academic Computing & Network Services
Michigan State University
506 Computer Center
East Lansing, MI 48824-1042
1-517-432-7327
Baseball is ninety percent mental; the other half is physical.
- Yogi Berra