[OpenAFS] RO failover issue

Tue, 15 Mar 2005 11:49:00 -0500

We're running OpenAFS 1.2.13 on our servers, and have 1.3.77 Windows 
clients and 1.2.13 Linux (2.4.26) and OS X clients.  Last week we had a 
failure in the storage unit for one of our AFS fileservers, which 
caused a read-only volume on that server to be taken offline.  The RW 
copy was on our other fileserver, and a RO copy was also present on 
that other fileserver.  None of our clients used the available RO 
volume.  Instead they timed out trying to read from the one volume 
which was offline.

Naturally, the volume in question was the one which holds the mount 
points for all user home volumes...  I removed the mount point, and 
re-added it, with the -rw flag.  That got things back up and running, 
and I was able to resolve our other issues.  We're back to using the RO 
copy at this point.

So, my question is:  why didn't the clients just use a different RO 
copy?  I thought that's what they were supposed to do.  I'm sure I 
missed some critical troubleshooting step which would provide the 
answer to what caused this.  If so -- my bad.  I probably won't be able 
to provide such information at this point, as our lab systems are 
re-imaged every night.

My logs (combination of a few of them) look like:

Mar  9 17:47:15 afs1 kernel: SCSI disk error : host 1 channel 0 id 0 
lun 1 return code = 10000
Mar 9 17:47:16 2005 VGetVnode: Couldn't read vnode 3579, volume 
536870922 (user.readonly); volume needs salvage
Mar 9 17:47:16 2005 Volume 536870922 forced offline: it needs salvaging!
Mar  9 17:47:16 afs1 kernel: SCSI disk error : host 1 channel 0 id 0 
lun 1 return code = 10000
Mar  9 17:47:17 afs1 kernel: SCSI disk error : host 1 channel 0 id 0 
lun 1 return code = 10000

Any thoughts on why it didn't fail over to the other RO copy would be 
appreciated, as would any advice on what troubleshooting should have 
been done.

Thanks in advance,

Brian Davidson
George Mason University