[OpenAFS] One DB server of three going offline kills 1.6 clients
("Waiting for busy volume")
Jeff White
jaw171@pitt.edu
Mon, 30 Jul 2012 13:46:10 -0400
I recently built two RHEL 6.3 x64 systems with 1.6.1-1 (compiled from
the src.rpm) and they consistently have issues when one of our DB
servers (running 1.2.11) is brought down for a cold backup of the AFS
databases. Our older clients (1.4.14.1-1 and below) do not have this issue.
We have three DB servers (afs09, afs10, afs11) with afs09 as the
master. Sunday at 4:05 AM a script run to stop the AFS DB processes on
afs11 and tar the DB files then start the processes again. When this
happens our new 1.6.1 clients hang and begin spewing a large number of
these errors:
Jul 29 04:00:27 <kern.warning> ewi-afs-prod0 kernel: afs: Waiting for
busy volume 1937412136 () in cell pitt.edu
Sometimes it is able to determine the volume name, sometimes not. When
this happen I cannot access anything in our AFS cell on the failing
client, even after a reboot. The one DB server is down only for a
minute yet the issues continue after the DB server is back up.
So, a few questions:
Has anyone seen this behavior before when one DB server becomes
inaccessible but other DB servers are available?
Is there anything I can do to troubleshoot the issue to help determine
what is casing it?
If a client is talking to a particular DB server and the remote system
stops responding, will the client silently move on to trying a different
DB server or is it sticky to the same server and keep trying to talk to it?
I would hope that the last part of that is not true. It should work
like DNS by trying every DB server in sequence and only returning an
error once all servers have failed.
Jul 29 04:22:31 <kern.err> ewi-afs-prod0 kernel: INFO: task httpd:1542
blocked for more than 120 seconds.
Jul 29 04:22:31 <kern.err> ewi-afs-prod0 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 29 04:22:31 <kern.info> ewi-afs-prod0 kernel: httpd D
0000000000000000 0 1542 1535 0x00000080
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel: ffff88013b8f3ba8
0000000000000082 ffff88013b8f3c38 ffff880139bc1000
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel: ffff88013b8f3b68
ffffffffa02db742 ffff880137a9eae0 ffff880137a9eae0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel: ffff880137a9f098
ffff88013b8f3fd8 000000000000fb88 ffff880137a9f098
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel: Call Trace:
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffffa02db742>] ? afs_FindVCache+0xe2/0x5b0 [openafs]
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff814fee9e>] __mutex_lock_slowpath+0x13e/0x180
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffffa02e1a81>] ? afs_access+0x181/0x730 [openafs]
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff814fed3b>] mutex_lock+0x2b/0x50
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff8118957b>] do_lookup+0x11b/0x230
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff8118999d>] __link_path_walk+0x20d/0x1030
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff8118aa4a>] path_walk+0x6a/0xe0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff8118ac1b>] do_path_lookup+0x5b/0xa0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff8118b887>] user_path_at+0x57/0xa0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff81148641>] ? unlink_anon_vmas+0x71/0xd0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff811804bc>] vfs_fstatat+0x3c/0x80
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff8118062b>] vfs_stat+0x1b/0x20
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff81180654>] sys_newstat+0x24/0x50
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff810d69f2>] ? audit_syscall_entry+0x272/0x2a0
Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
--
Jeff White - GNU+Linux Systems Engineer
University of Pittsburgh - CSSD