[OpenAFS] One DB server of three going offline kills 1.6 clients ("Waiting for busy volume")

Derrick Brashear shadow@gmail.com
Mon, 6 Aug 2012 11:05:04 -0400


resolved: actual issue was the weekly fileserver restart and the
dbservers were not relevant.

On Mon, Jul 30, 2012 at 1:46 PM, Jeff White <jaw171@pitt.edu> wrote:
> I recently built two RHEL 6.3 x64 systems with 1.6.1-1 (compiled from the
> src.rpm) and they consistently have issues when one of our DB servers
> (running 1.2.11) is brought down for a cold backup of the AFS databases.
> Our older clients (1.4.14.1-1 and below) do not have this issue.
>
> We have three DB servers (afs09, afs10, afs11) with afs09 as the master.
> Sunday at 4:05 AM a script run to stop the AFS DB processes on afs11 and tar
> the DB files then start the processes again.  When this happens our new
> 1.6.1 clients hang and begin spewing a large number of these errors:
>
> Jul 29 04:00:27 <kern.warning> ewi-afs-prod0 kernel: afs: Waiting for busy
> volume 1937412136 () in cell pitt.edu
>
> Sometimes it is able to determine the volume name, sometimes not.  When this
> happen I cannot access anything in our AFS cell on the failing client, even
> after a reboot.  The one DB server is down only for a minute yet the issues
> continue after the DB server is back up.
>
> So, a few questions:
>
> Has anyone seen this behavior before when one DB server becomes inaccessible
> but other DB servers are available?
> Is there anything I can do to troubleshoot the issue to help determine what
> is casing it?
> If a client is talking to a particular DB server and the remote system stops
> responding, will the client silently move on to trying a different DB server
> or is it sticky to the same server and keep trying to talk to it?
>
> I would hope that the last part of that is not true.  It should work like
> DNS by trying every DB server in sequence and only returning an error once
> all servers have failed.
>
>
>
>
> Jul 29 04:22:31 <kern.err> ewi-afs-prod0 kernel: INFO: task httpd:1542
> blocked for more than 120 seconds.
> Jul 29 04:22:31 <kern.err> ewi-afs-prod0 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jul 29 04:22:31 <kern.info> ewi-afs-prod0 kernel: httpd         D
> 0000000000000000     0  1542   1535 0x00000080
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  ffff88013b8f3ba8
> 0000000000000082 ffff88013b8f3c38 ffff880139bc1000
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  ffff88013b8f3b68
> ffffffffa02db742 ffff880137a9eae0 ffff880137a9eae0
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  ffff880137a9f098
> ffff88013b8f3fd8 000000000000fb88 ffff880137a9f098
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel: Call Trace:
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffffa02db742>] ?
> afs_FindVCache+0xe2/0x5b0 [openafs]
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff814fee9e>]
> __mutex_lock_slowpath+0x13e/0x180
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffffa02e1a81>] ?
> afs_access+0x181/0x730 [openafs]
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff814fed3b>]
> mutex_lock+0x2b/0x50
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff8118957b>]
> do_lookup+0x11b/0x230
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff8118999d>]
> __link_path_walk+0x20d/0x1030
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff8105b483>] ?
> perf_event_task_sched_out+0x33/0x80
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff8118aa4a>]
> path_walk+0x6a/0xe0
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff8118ac1b>]
> do_path_lookup+0x5b/0xa0
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff8118b887>]
> user_path_at+0x57/0xa0
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff81148641>] ?
> unlink_anon_vmas+0x71/0xd0
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff811804bc>]
> vfs_fstatat+0x3c/0x80
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff8118062b>]
> vfs_stat+0x1b/0x20
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff81180654>]
> sys_newstat+0x24/0x50
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff810d69f2>] ?
> audit_syscall_entry+0x272/0x2a0
> Jul 29 04:22:31 <kern.warning> ewi-afs-prod0 kernel:  [<ffffffff8100b0f2>]
> system_call_fastpath+0x16/0x1b
>
> --
> Jeff White - GNU+Linux Systems Engineer
> University of Pittsburgh - CSSD
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info



-- 
Derrick