[OpenAFS] One DB server of three going offline kills 1.6 clients ("Waiting for busy volume")

Jeff White jaw171@pitt.edu
Mon, 30 Jul 2012 13:46:10 -0400


I recently built two RHEL 6.3 x64 systems with 1.6.1-1 (compiled from 
the src.rpm) and they consistently have issues when one of our DB 
servers (running 1.2.11) is brought down for a cold backup of the AFS 
databases.  Our older clients (1.4.14.1-1 and below) do not have this issue.

We have three DB servers (afs09, afs10, afs11) with afs09 as the 
master.  Sunday at 4:05 AM a script run to stop the AFS DB processes on 
afs11 and tar the DB files then start the processes again.  When this 
happens our new 1.6.1 clients hang and begin spewing a large number of 
these errors:

Jul 29 04:00:27 <kern.warning> ewi-afs-prod0 kernel: afs: Waiting for 
busy volume 1937412136 () in cell pitt.edu

Sometimes it is able to determine the volume name, sometimes not.  When 
this happen I cannot access anything in our AFS cell on the failing 
client, even after a reboot.  The one DB server is down only for a 
minute yet the issues continue after the DB server is back up.

So, a few questions:

Has anyone seen this behavior before when one DB server becomes 
inaccessible but other DB servers are available?
Is there anything I can do to troubleshoot the issue to help determine 
what is casing it?
If a client is talking to a particular DB server and the remote system 
stops responding, will the client silently move on to trying a different 
DB server or is it sticky to the same server and keep trying to talk to it?

I would hope that the last part of that is not true.  It should work 
like DNS by trying every DB server in sequence and only returning an 
error once all servers have failed.




Jul 29 04:22:31 <kern.err> ewi-afs-prod0 kernel: INFO: task httpd:1542 
blocked for more than 120 seconds.
Jul 29 04:22:31 <kern.err> ewi-afs-prod0 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 29 04:22:31 <kern.info> ewi-afs-prod0 kernel: httpd         D 
0000000000000000     0  1542   1535 0x00000080
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  ffff88013b8f3ba8 
0000000000000082 ffff88013b8f3c38 ffff880139bc1000
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  ffff88013b8f3b68 
ffffffffa02db742 ffff880137a9eae0 ffff880137a9eae0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  ffff880137a9f098 
ffff88013b8f3fd8 000000000000fb88 ffff880137a9f098
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel: Call Trace:
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffffa02db742>] ? afs_FindVCache+0xe2/0x5b0 [openafs]
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff814fee9e>] __mutex_lock_slowpath+0x13e/0x180
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffffa02e1a81>] ? afs_access+0x181/0x730 [openafs]
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff814fed3b>] mutex_lock+0x2b/0x50
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff8118957b>] do_lookup+0x11b/0x230
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff8118999d>] __link_path_walk+0x20d/0x1030
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff8118aa4a>] path_walk+0x6a/0xe0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff8118ac1b>] do_path_lookup+0x5b/0xa0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff8118b887>] user_path_at+0x57/0xa0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff81148641>] ? unlink_anon_vmas+0x71/0xd0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff811804bc>] vfs_fstatat+0x3c/0x80
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff8118062b>] vfs_stat+0x1b/0x20
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff81180654>] sys_newstat+0x24/0x50
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff810d69f2>] ? audit_syscall_entry+0x272/0x2a0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

-- 
Jeff White - GNU+Linux Systems Engineer
University of Pittsburgh - CSSD