[OpenAFS] Linux client connection timed out after server failure

Bob Hoffman hoffman@cs.pitt.edu
Thu, 16 Aug 2012 10:22:27 -0400

One of our file servers died and I restored all of its volumes to
another file server.  I've searched through the openafs-info archives
and tried everything I could find but nothing seems to help.

A number of clients persist in saying "Connection timed out" even after
the volumes were brought on-line on the new server.

Both the server that failed and the new server run OpenAFS 1.6.1 on
CentOS 6.2.

Three of the clients that are getting timeouts are:

OpenAFS 1.4.11 on RHEL4
OpenAFS 1.4.12 on CentOS 5.5
OpenAFS 1.4.14 on CentOS 5.5

The volume that's timing out is 'cast' and it's in a volume called
'projects'.   Projects is replicated. 'fs checkservers' reports that the
server is not responding, which is correct since the machine is
completely down.  There are no volumes in the vldb that are listed as
being on that dead server.

Here is what I've tried so far with no effect whatsoever:

fs flushmount /afs/cs.pitt.edu/projects/cast
fs flushmount /afs/cs.pitt.edu/projects
fs flushmount /afs/.cs.pitt.edu/projects/cast
fs flushmount /afs/.cs.pitt.edu/projects
fs flushvolume /afs/cs.pitt.edu/projects/cast
fs flushvolume /afs/cs.pitt.edu/projects
fs flushvolume /afs/.cs.pitt.edu/projects/cast
fs flushvolume /afs/.cs.pitt.edu/projects
vos release projects
ls -l /afs/cs.pitt.edu/projects
ls -l /afs/.cs.pitt.edu/projects

Is there anything I can do, short of a client reboot, to fix this?
Shouldn't AFS have a more graceful recovery in this kind of situation?
Why doesn't the client see that the volume has moved to a new server?

      Thanks in advance,