[OpenAFS] Timeouts and odd behavior with 1.6.0 file servers

Jack Neely jjneely@pams.ncsu.edu
Wed, 25 Jan 2012 17:16:25 -0500


We are working our way through a migration from old Sun AFS hardware
running Openafs 1.4.11 to HP Blades running RHEL 6 with OpenAFS 1.6.0.
At this point we've completed most of our file servers.

With most of our volumes on 1.6.0 DAFS servers we have started to see
some odd behavior.  Our Subversion servers have kept SVN repos in AFS
for years, we've not upgraded the Subversion software.  (SVN servers are
RHEL 5 with OpenAFS 1.4.11.)  But now SVN often tells us:

    Transmitting file data ...svn: Commit failed (details follow):
    svn: database disk image is malformed

At this point we know that the SQLite databases in Subversions fsfs
backend has become corrupt.

RHEL 6 / 1.6.0 clients wired into network occasionally have long pauses
when doing AFS operations, such as running ls.  It may take 30 seconds
to a minute for the AFS server (the datacenter is downstairs) to
respond.  We are not seeing high load or any signs on the server that
something is wrong.

The above applies as well to our web servers that are RHEL 6 / 1.6.0.
Several times a week load on the web servers will suddenly spike and
rxdebug tells us that RX calls to one of the AFS servers are all/mostly
in the reader_wait state.  Just as suddenly as it starts, its over with.

    call 0: # 5231, state active, mode: receiving, flags: reader_wait

Our cron job that mirrors CPAN to AFS space now often fails with time
out errors.

    readlink_stat("/afs/...") failed: Connection timed out (110)

All of these by themselves is just a fluke or a network glitch.  But as
time progresses we are starting to see a pattern emerge.  Any clues of
what may be happening?

