[OpenAFS] AFS client hangs indefinitely following backup

Sat, 19 Jun 2004 22:43:02 -0400 (EDT)

On Fri, 18 Jun 2004, Charles Karney wrote:

> We have encountered problems with our clients hanging on AFS accesses.
> Yesterday I think I engineering a more-or-less reproducible set of
> circumstance to reproduce this.

I'll guess it's the same problem the person with the SuSe 9 machine
complained of, notably that nptl and the fileserver are feuding.

Standard answers apply:
try LD_ASSUME_KERNEL=2.4.1, or failing that, try the LWP fileserver from
src/viced/fileserver.

> Configuration:
>
>   RH Linux 9 (clients and servers)
>   openafs 1.2.11-rh9
>
> Symptom
>
>   During and following a full backup of large (~30GB) volumes, client
>   hang (e.g., in 'ls').
>
>   After the backup
>
>     bos status
>     fs checkserv
>
>   both indicate that the servers are up and accessible.
>
>   Clients unfreeze when server with the large volumes is restarted
>   (bos restart...)
>
> Details
>
> We run a small AFS cell with ~30 volumes some of which are large (5GB to
> 30GB).  We have 3 servers, several client machines, and two human AFS
> users.  We do backups in the "approved" way, namely
>
>     vos backupsys ...
>
> followed by
>
>     backup dump
>
> of the volumes *.backup.  During the course of the backup of the largest
> volumes I see several messages of form
>
> Thu Jun 17 20:05:08 2004 trans 24 on volume 536871469 is older than 1200 seconds
>
> in VolserLog, where 536871469 is the ID of the BK version of one of of
> large volumes.  There are no other indications of problems in the server
> logs.  During the backup there was no AFS client activity.
>
> I let the backup run to completion.  At this point, some clients now
> freeze on accessing AFS.
>
> I can get the clients to unfreeze by restarting the AFS server with
> the large volumes (bos restart server -all).
>
> Notes:
>
>   The freeze was associated with accessing RW and RO volumes (not
>   necessarily the recently locked AFS volumes).
>
>   No "lost contact with file server" messages in log files on client.
>
>   fs checkserv says "All servers are running".
>
>   bos status shows all 3 AFS servers are OK.
>
>   No "connection timed out" message on the client.
>
> It seems that one or more of the servers have ended up in an
> "unresponsive" mode during the backup, even though all the normal
> diagnostic claim that they are all running OK.
>
> Other information:
>
>   This isn't an easy problem to diagnose since the full backup takes ~3
>   hours and I don't like to endlessly clobber our AFS setup.
>
>   Sometimes in similar circumstances, I DO get the "lost contact with
>   file server" but I don't get the "back up" message.  In this case "fs
>   checkserv" agrees that one of the servers is down, but "bos status"
>   claims that it's up.  Again restarting some or all of the servers
>   appears to be necessary.
>
>   Similar circumstances = full backups, moving a large volume, "vos
>   backup" on a large volume.  The common thread appears to be the
>   presence of the
>
>     trans xx on volume nnnnnn is older than yyyy seconds
>
>   messages in VolserLog.
>
>   We have iptables firewalling in effect.  On the clients
>     [0:0] -A trust -p udp -m udp --sport afs3-fileserver
>           --dport afs3-callback -j ACCEPT
>     [0:0] -A trust -p tcp -m tcp --sport afs3-fileserver
>            --dport afs3-callback -j ACCEPT
>
>   On the servers
>
>     [0:0] -A trust -p udp -m udp --dport 88 -j ACCEPT
>     [0:0] -A trust -p tcp -m tcp --dport 88 -j ACCEPT
>     [0:0] -A trust -p udp -m udp --dport 750:751 -j ACCEPT
>     [0:0] -A trust -p tcp -m tcp --dport 750:751 -j ACCEPT
>     [0:0] -A trust -p udp -m udp --dport 7000:7009 -j ACCEPT
>     [0:0] -A trust -p tcp -m tcp --dport 7000:7009 -j ACCEPT
>     [0:0] -A trust -p udp -m udp --dport 7021 -j ACCEPT
>     [0:0] -A trust -p tcp -m tcp --dport 7021 -j ACCEPT
>     [0:0] -A trust -p udp -m udp --dport 7025:7027 -j ACCEPT
>     [0:0] -A trust -p tcp -m tcp --dport 7025:7027 -j ACCEPT
>
> Any advice on how to cure or to diagnose this problem would be
> appreciated.  Thanks.
>
> --
> Charles Karney                  Email:  ckarney@sarnoff.com
> 201 Washington Rd               URL:    http://charles.karney.info
> Sarnoff Corporation             Phone:  +1 609 734 2312
> Princeton, NJ 08543-5300        Fax:    +1 609 734 2323
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>