[OpenAFS] Re: Linux client connection timed out after server failure

Andrew Deason adeason@sinenomine.net
Thu, 16 Aug 2012 10:03:25 -0500


On Thu, 16 Aug 2012 10:22:27 -0400
Bob Hoffman <hoffman@cs.pitt.edu> wrote:

> A number of clients persist in saying "Connection timed out" even after
> the volumes were brought on-line on the new server.

Clients cache volume location information for 2 hours. They will
continue to think that the volumes are on the old server until you
invalidate the cache, or they receive a certain type of error. "The
server is not responding" is not one of those errors.

> Here is what I've tried so far with no effect whatsoever:
> 
> fs flushmount /afs/cs.pitt.edu/projects/cast
> fs flushmount /afs/cs.pitt.edu/projects
> fs flushmount /afs/.cs.pitt.edu/projects/cast
> fs flushmount /afs/.cs.pitt.edu/projects
> fs flushvolume /afs/cs.pitt.edu/projects/cast
> fs flushvolume /afs/cs.pitt.edu/projects
> fs flushvolume /afs/.cs.pitt.edu/projects/cast
> fs flushvolume /afs/.cs.pitt.edu/projects
> vos release projects
> ls -l /afs/cs.pitt.edu/projects
> ls -l /afs/.cs.pitt.edu/projects

Try 'fs checkvolumes'.

> Is there anything I can do, short of a client reboot, to fix this?
> Shouldn't AFS have a more graceful recovery in this kind of situation?
> Why doesn't the client see that the volume has moved to a new server?

It could in theory recheck the vldb in this scenario, but there are
other issues with doing that, since the majority of the time such errors
are encountered when the volume hasn't moved or anything.

If this situation lasted for more than 2 hours and/or survived an 'fs
checkv', that's a problem. For that, you can capture some debug data
like so:

fstrace clear cm
fstrace setlog cmfx -buffers 1024
fstrace sets cm -active
ls /afs/cs.pitt.edu/projects/cast &
echo $!
wait
fstrace dump cm > /tmp/fstrace.log
fstrace sets cm -inactive

-- 
Andrew Deason
adeason@sinenomine.net