[OpenAFS] Timeouts and odd behavior with 1.6.0 file servers

Russ Allbery rra@stanford.edu
Wed, 25 Jan 2012 14:22:26 -0800

Jack Neely <jjneely@pams.ncsu.edu> writes:

> We are working our way through a migration from old Sun AFS hardware
> running Openafs 1.4.11 to HP Blades running RHEL 6 with OpenAFS 1.6.0.
> At this point we've completed most of our file servers.

Don't use the 1.6.0 file server.  It has a data corruption problem when
you have an inode clone (such as a backup volume or a migration clone) and
directories are moved with mv.  These are fixed in 1.6.1pre1 and in the
Debian 1.6.0-3 packages.  This may be what you're running into.

> RHEL 6 / 1.6.0 clients wired into network occasionally have long pauses
> when doing AFS operations, such as running ls.  It may take 30 seconds
> to a minute for the AFS server (the datacenter is downstairs) to
> respond.  We are not seeing high load or any signs on the server that
> something is wrong.

> The above applies as well to our web servers that are RHEL 6 / 1.6.0.
> Several times a week load on the web servers will suddenly spike and
> rxdebug tells us that RX calls to one of the AFS servers are all/mostly
> in the reader_wait state.  Just as suddenly as it starts, its over with.

>     call 0: # 5231, state active, mode: receiving, flags: reader_wait

> Our cron job that mirrors CPAN to AFS space now often fails with time
> out errors.

>     readlink_stat("/afs/...") failed: Connection timed out (110)

Yes, this is consistent with the problems that we're seeing on our web
servers with OpenAFS 1.4 as well, which are probably due at least in part
to the pathological idledead interactions with the way that server threads
can back up waiting for vnode locks.  1.6.1pre2 (coming shortly) has both
client and server fixes for the idledead part of this.

Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>