[OpenAFS] AFS client hangs indefinitely following backup
Derrick J Brashear
shadow@dementia.org
Sat, 19 Jun 2004 22:43:02 -0400 (EDT)
On Fri, 18 Jun 2004, Charles Karney wrote:
> We have encountered problems with our clients hanging on AFS accesses.
> Yesterday I think I engineering a more-or-less reproducible set of
> circumstance to reproduce this.
I'll guess it's the same problem the person with the SuSe 9 machine
complained of, notably that nptl and the fileserver are feuding.
Standard answers apply:
try LD_ASSUME_KERNEL=2.4.1, or failing that, try the LWP fileserver from
src/viced/fileserver.
> Configuration:
>
> RH Linux 9 (clients and servers)
> openafs 1.2.11-rh9
>
> Symptom
>
> During and following a full backup of large (~30GB) volumes, client
> hang (e.g., in 'ls').
>
> After the backup
>
> bos status
> fs checkserv
>
> both indicate that the servers are up and accessible.
>
> Clients unfreeze when server with the large volumes is restarted
> (bos restart...)
>
> Details
>
> We run a small AFS cell with ~30 volumes some of which are large (5GB to
> 30GB). We have 3 servers, several client machines, and two human AFS
> users. We do backups in the "approved" way, namely
>
> vos backupsys ...
>
> followed by
>
> backup dump
>
> of the volumes *.backup. During the course of the backup of the largest
> volumes I see several messages of form
>
> Thu Jun 17 20:05:08 2004 trans 24 on volume 536871469 is older than 1200 seconds
>
> in VolserLog, where 536871469 is the ID of the BK version of one of of
> large volumes. There are no other indications of problems in the server
> logs. During the backup there was no AFS client activity.
>
> I let the backup run to completion. At this point, some clients now
> freeze on accessing AFS.
>
> I can get the clients to unfreeze by restarting the AFS server with
> the large volumes (bos restart server -all).
>
> Notes:
>
> The freeze was associated with accessing RW and RO volumes (not
> necessarily the recently locked AFS volumes).
>
> No "lost contact with file server" messages in log files on client.
>
> fs checkserv says "All servers are running".
>
> bos status shows all 3 AFS servers are OK.
>
> No "connection timed out" message on the client.
>
> It seems that one or more of the servers have ended up in an
> "unresponsive" mode during the backup, even though all the normal
> diagnostic claim that they are all running OK.
>
> Other information:
>
> This isn't an easy problem to diagnose since the full backup takes ~3
> hours and I don't like to endlessly clobber our AFS setup.
>
> Sometimes in similar circumstances, I DO get the "lost contact with
> file server" but I don't get the "back up" message. In this case "fs
> checkserv" agrees that one of the servers is down, but "bos status"
> claims that it's up. Again restarting some or all of the servers
> appears to be necessary.
>
> Similar circumstances = full backups, moving a large volume, "vos
> backup" on a large volume. The common thread appears to be the
> presence of the
>
> trans xx on volume nnnnnn is older than yyyy seconds
>
> messages in VolserLog.
>
> We have iptables firewalling in effect. On the clients
> [0:0] -A trust -p udp -m udp --sport afs3-fileserver
> --dport afs3-callback -j ACCEPT
> [0:0] -A trust -p tcp -m tcp --sport afs3-fileserver
> --dport afs3-callback -j ACCEPT
>
> On the servers
>
> [0:0] -A trust -p udp -m udp --dport 88 -j ACCEPT
> [0:0] -A trust -p tcp -m tcp --dport 88 -j ACCEPT
> [0:0] -A trust -p udp -m udp --dport 750:751 -j ACCEPT
> [0:0] -A trust -p tcp -m tcp --dport 750:751 -j ACCEPT
> [0:0] -A trust -p udp -m udp --dport 7000:7009 -j ACCEPT
> [0:0] -A trust -p tcp -m tcp --dport 7000:7009 -j ACCEPT
> [0:0] -A trust -p udp -m udp --dport 7021 -j ACCEPT
> [0:0] -A trust -p tcp -m tcp --dport 7021 -j ACCEPT
> [0:0] -A trust -p udp -m udp --dport 7025:7027 -j ACCEPT
> [0:0] -A trust -p tcp -m tcp --dport 7025:7027 -j ACCEPT
>
> Any advice on how to cure or to diagnose this problem would be
> appreciated. Thanks.
>
> --
> Charles Karney Email: ckarney@sarnoff.com
> 201 Washington Rd URL: http://charles.karney.info
> Sarnoff Corporation Phone: +1 609 734 2312
> Princeton, NJ 08543-5300 Fax: +1 609 734 2323
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>