[OpenAFS] Fileserver loses contact with itself

Tom Fitzgerald tfitz@MIT.EDU
Tue, 18 Nov 2003 18:05:19 -0500


I've got a situation where, under heavy load (backups plus
normal use), a fileserver loses contact with itself,
syslogging:

     Nov 18 15:55:44 cashel kernel: afs: Lost contact with file server
18.89.2.206 in cell soap.mit.edu (all multi-homed ip addresses down for
the server)

... logged on 18.89.2.206 itself.  At that point, all access
to rw volumes on that fileserver becomes flaky, both local
and remote.  ls will show mount points, but ls'ing the mount
points will result in "No such file or directory".  "vos exa"
reports that the volume is fine and not locked.  "fs checks"
reports "All servers are running", "fs flushvol" and
"vos unlock" have no effect.

The problem persists for 5 minutes after the fileserver
process is restarted, then goes away with no further action.

This is with OpenAFS 1.2.9 on a heavily modified RedHat 9
system, Linux 2.4.20 OS.

Any help would be appreciated (and if upgrading to 1.2.10
has a definite chance of fixing this, I'll do it, but
I was hoping to avoid this.)