[OpenAFS] AFS client hangs indefinitely following backup

Fri, 18 Jun 2004 13:54:42 -0400

We have encountered problems with our clients hanging on AFS accesses.
Yesterday I think I engineering a more-or-less reproducible set of
circumstance to reproduce this.

Configuration:

  RH Linux 9 (clients and servers)
  openafs 1.2.11-rh9

Symptom

  During and following a full backup of large (~30GB) volumes, client
  hang (e.g., in 'ls').

  After the backup

    bos status
    fs checkserv

  both indicate that the servers are up and accessible.

  Clients unfreeze when server with the large volumes is restarted
  (bos restart...)

Details

We run a small AFS cell with ~30 volumes some of which are large (5GB to
30GB).  We have 3 servers, several client machines, and two human AFS
users.  We do backups in the "approved" way, namely

    vos backupsys ...

followed by

    backup dump

of the volumes *.backup.  During the course of the backup of the largest
volumes I see several messages of form

Thu Jun 17 20:05:08 2004 trans 24 on volume 536871469 is older than 1200 seconds

in VolserLog, where 536871469 is the ID of the BK version of one of of
large volumes.  There are no other indications of problems in the server
logs.  During the backup there was no AFS client activity.

I let the backup run to completion.  At this point, some clients now
freeze on accessing AFS.

I can get the clients to unfreeze by restarting the AFS server with
the large volumes (bos restart server -all).

Notes:

  The freeze was associated with accessing RW and RO volumes (not
  necessarily the recently locked AFS volumes).

  No "lost contact with file server" messages in log files on client.

  fs checkserv says "All servers are running".

  bos status shows all 3 AFS servers are OK.

  No "connection timed out" message on the client.

It seems that one or more of the servers have ended up in an
"unresponsive" mode during the backup, even though all the normal
diagnostic claim that they are all running OK.

Other information:

  This isn't an easy problem to diagnose since the full backup takes ~3
  hours and I don't like to endlessly clobber our AFS setup.

  Sometimes in similar circumstances, I DO get the "lost contact with
  file server" but I don't get the "back up" message.  In this case "fs
  checkserv" agrees that one of the servers is down, but "bos status"
  claims that it's up.  Again restarting some or all of the servers
  appears to be necessary.

  Similar circumstances = full backups, moving a large volume, "vos
  backup" on a large volume.  The common thread appears to be the
  presence of the 

    trans xx on volume nnnnnn is older than yyyy seconds

  messages in VolserLog.

  We have iptables firewalling in effect.  On the clients
    [0:0] -A trust -p udp -m udp --sport afs3-fileserver
          --dport afs3-callback -j ACCEPT
    [0:0] -A trust -p tcp -m tcp --sport afs3-fileserver
           --dport afs3-callback -j ACCEPT

  On the servers

    [0:0] -A trust -p udp -m udp --dport 88 -j ACCEPT
    [0:0] -A trust -p tcp -m tcp --dport 88 -j ACCEPT
    [0:0] -A trust -p udp -m udp --dport 750:751 -j ACCEPT
    [0:0] -A trust -p tcp -m tcp --dport 750:751 -j ACCEPT
    [0:0] -A trust -p udp -m udp --dport 7000:7009 -j ACCEPT
    [0:0] -A trust -p tcp -m tcp --dport 7000:7009 -j ACCEPT
    [0:0] -A trust -p udp -m udp --dport 7021 -j ACCEPT
    [0:0] -A trust -p tcp -m tcp --dport 7021 -j ACCEPT
    [0:0] -A trust -p udp -m udp --dport 7025:7027 -j ACCEPT
    [0:0] -A trust -p tcp -m tcp --dport 7025:7027 -j ACCEPT

Any advice on how to cure or to diagnose this problem would be
appreciated.  Thanks.

-- 
Charles Karney                  Email:  ckarney@sarnoff.com
201 Washington Rd               URL:    http://charles.karney.info
Sarnoff Corporation             Phone:  +1 609 734 2312
Princeton, NJ 08543-5300        Fax:    +1 609 734 2323