[OpenAFS] RO failover issue
Brian Davidson
bdavids1@gmu.edu
Tue, 15 Mar 2005 11:49:00 -0500
We're running OpenAFS 1.2.13 on our servers, and have 1.3.77 Windows
clients and 1.2.13 Linux (2.4.26) and OS X clients. Last week we had a
failure in the storage unit for one of our AFS fileservers, which
caused a read-only volume on that server to be taken offline. The RW
copy was on our other fileserver, and a RO copy was also present on
that other fileserver. None of our clients used the available RO
volume. Instead they timed out trying to read from the one volume
which was offline.
Naturally, the volume in question was the one which holds the mount
points for all user home volumes... I removed the mount point, and
re-added it, with the -rw flag. That got things back up and running,
and I was able to resolve our other issues. We're back to using the RO
copy at this point.
So, my question is: why didn't the clients just use a different RO
copy? I thought that's what they were supposed to do. I'm sure I
missed some critical troubleshooting step which would provide the
answer to what caused this. If so -- my bad. I probably won't be able
to provide such information at this point, as our lab systems are
re-imaged every night.
My logs (combination of a few of them) look like:
Mar 9 17:47:15 afs1 kernel: SCSI disk error : host 1 channel 0 id 0
lun 1 return code = 10000
Mar 9 17:47:16 2005 VGetVnode: Couldn't read vnode 3579, volume
536870922 (user.readonly); volume needs salvage
Mar 9 17:47:16 2005 Volume 536870922 forced offline: it needs salvaging!
Mar 9 17:47:16 afs1 kernel: SCSI disk error : host 1 channel 0 id 0
lun 1 return code = 10000
Mar 9 17:47:17 afs1 kernel: SCSI disk error : host 1 channel 0 id 0
lun 1 return code = 10000
Any thoughts on why it didn't fail over to the other RO copy would be
appreciated, as would any advice on what troubleshooting should have
been done.
Thanks in advance,
Brian Davidson
George Mason University