[OpenAFS] AFS client hanged

Benjamin Kaduk kaduk@mit.edu
Sun, 22 Dec 2019 15:36:55 -0800


Hi Qiulan,

Can you please confirm that there are no other "INFO: task ... blocked for
more than 120 seconds" messages?  This is rather curious, since of the
three tasks listed here, one is waiting trying to acquire the AFS_GLOCK,
and the other two are sleeping (having dropped the AFS_GLOCK in order to
sleep_.  So if something is holding on to the AFS_GLOCK that keeps the
afs_callback task from acquiring it, that task would be expected to also
show up in the list of blocked tasks as output by the kernel.

Thanks,

Ben

On Mon, Dec 16, 2019 at 02:53:43PM +0800, huangql wrote:
> Dear all,
> 
> Recently, I'm stuck with some AFS issues.
> 
> AFS client hanged with the following log message. In this case, the AFS instance blocked and jobs failed to access any files located in AFS. I have to reboot the work node to recover service.
> 
> Dec  6 15:03:18 bws0825 kernel: INFO: task afs_callback:19124 blocked for more than 120 seconds.
> Dec  6 15:03:18 bws0825 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec  6 15:03:18 bws0825 kernel: afs_callback    D ffff9860d826e180     0 19124      2 0x00000000
> Dec  6 15:03:18 bws0825 kernel: Call Trace:
> Dec  6 15:03:18 bws0825 kernel: afs_callback    D ffff9860d826e180     0 19124      2 0x00000000
> Dec  6 15:03:18 bws0825 kernel: Call Trace:
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa2169df9>] schedule_preempt_disabled+0x29/0x70
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa2167d77>] __mutex_lock_slowpath+0xc7/0x1d0
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa216715f>] mutex_lock+0x1f/0x2f
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc084dff4>] SRXAFSCB_InitCallBackState+0x34/0x470 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc0898047>] ? afs_xdr_vector+0x57/0x90 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc084f19e>] SRXAFSCB_InitCallBackState3+0xe/0x10 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08b6f43>] RXAFSCB_ExecuteRequest+0x6f3/0x8a0 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1b028ae>] ? getnstimeofday64+0xe/0x30
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08ae589>] ? afs_mutex_exit+0x29/0x40 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08a6a5d>] rxi_ServerProc+0xcd/0x1e0 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ? afs_shutdown_pagecopy+0x20/0x20 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08af017>] rx_ServerProc+0x87/0xe0 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc084eedd>] afs_RXCallBackServer+0x3d/0x50 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08c76a5>] afsd_thread+0x1e5/0x730 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ? afs_shutdown_pagecopy+0x20/0x20 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ac1da1>] kthread+0xd1/0xe0
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ? insert_kthread_work+0x40/0x40
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa2175c1d>] ret_from_fork_nospec_begin+0x7/0x21
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ? insert_kthread_work+0x40/0x40
> Dec  6 15:03:18 bws0825 kernel: INFO: task afs_rxevent:19127 blocked for more than 120 seconds.
> Dec  6 15:03:18 bws0825 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec  6 15:03:18 bws0825 kernel: afs_rxevent     D ffff9860cbbf6180     0 19127      2 0x00000000
> Dec  6 15:03:18 bws0825 kernel: Call Trace:
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1aaa2d2>] ? del_timer_sync+0x52/0x60
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa2169df9>] schedule_preempt_disabled+0x29/0x70
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa2167d77>] __mutex_lock_slowpath+0xc7/0x1d0
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa216715f>] mutex_lock+0x1f/0x2f
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08bdb58>] afs_osi_TimedSleep+0x118/0x210 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ad6b60>] ? wake_up_state+0x20/0x20
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08bdce8>] afs_osi_Wait+0x98/0xd0 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ? afs_shutdown_pagecopy+0x20/0x20 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08af575>] afs_rxevent_daemon+0x95/0x140 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08c7af6>] afsd_thread+0x636/0x730 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ? afs_shutdown_pagecopy+0x20/0x20 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ac1da1>] kthread+0xd1/0xe0
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ? insert_kthread_work+0x40/0x40
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa2175c1d>] ret_from_fork_nospec_begin+0x7/0x21
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ? insert_kthread_work+0x40/0x40
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa2175c1d>] ret_from_fork_nospec_begin+0x7/0x21
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ? insert_kthread_work+0x40/0x40
> Dec  6 15:03:18 bws0825 kernel: INFO: task afs_checkserver:19870 blocked for more than 120 seconds.
> Dec  6 15:03:18 bws0825 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec  6 15:03:18 bws0825 kernel: afs_checkserver D ffff9860c7811040     0 19870      2 0x00000000
> Dec  6 15:03:18 bws0825 kernel: Call Trace:
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1aaa2d2>] ? del_timer_sync+0x52/0x60
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa2169df9>] schedule_preempt_disabled+0x29/0x70
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa2167d77>] __mutex_lock_slowpath+0xc7/0x1d0
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa216715f>] mutex_lock+0x1f/0x2f
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08bdb58>] afs_osi_TimedSleep+0x118/0x210 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ad6b60>] ? wake_up_state+0x20/0x20
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08bdce8>] afs_osi_Wait+0x98/0xd0 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc0853b08>] afs_CheckServerDaemon+0x118/0x1a0 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ? afs_shutdown_pagecopy+0x20/0x20 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08c7930>] afsd_thread+0x470/0x730 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ? afs_shutdown_pagecopy+0x20/0x20 [openafs]
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ac1da1>] kthread+0xd1/0xe0
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ? insert_kthread_work+0x40/0x40
> Dec  6 15:03:18 bws0825 kernel: [<ffffffffa2175c1d>] ret_from_fork_nospec_begin+0x7/0x21
> 
> 
> The following message is the specification of AFS cluster at IHEP. 
> 
> AFS clinet--sl7
> 
> [root@bws0825 ~]# rpm -qa|grep openafs
> openafs-1.6-sl-client-1.6.23-289.sl7.x86_64
> openafs-1.6-sl-authlibs-1.6.23-289.sl7.x86_64
> openafs-1.6-sl-devel-1.6.23-289.sl7.x86_64
> openafs-1.6-sl-module-tools-1.6.23-289.sl7.x86_64
> openafs-1.6-sl-krb5-1.6.23-289.sl7.x86_64
> openafs-1.6-sl-1.6.23-289.sl7.x86_64
> openafs-1.6-sl-authlibs-devel-1.6.23-289.sl7.x86_64
> kmod-openafs-1.6-sl-1.6.23-1.SL76.el7.noarch
> kmod-openafs-1.6-sl-957-1.6.23-289.sl7.957.x86_64
> 
> AFS client-SL6:
> openafs-krb5-1.6.23-289.sl6.x86_64
> openafs-client-1.6.23-289.sl6.x86_64
> openafs-1.6.23-289.sl6.x86_64
> openafs-kpasswd-1.6.23-289.sl6.x86_64
> openafs-module-tools-1.6.23-289.sl6.x86_64
> openafs-kernel-source-1.6.23-289.sl6.x86_64
> openafs-firstboot-1.6-1.sl6.noarch
> openafs-authlibs-1.6.23-289.sl6.x86_64
> kmod-openafs-696-1.6.20-257.sl6.696.x86_64
> kmod-openafs-1.6.22.3-1.SL610.el6.noarch
> openafs-compat-1.6.23-289.sl6.x86_64
> 
> AFS server side:
> 
> openafs-authlibs-1.6.20-256.sl6.x86_64
> openafs-server-1.6.11-1.el6.x86_64
> openafs-kernel-source-1.6.11-1.el6.x86_64
> openafs-docs-1.6.11-1.el6.x86_64
> openafs-authlibs-devel-1.6.20-256.sl6.x86_64
> openafs-1.6.11-1.el6.x86_64
> openafs-compat-1.6.11-1.el6.x86_64
> openafs-devel-1.6.20-256.sl6.x86_64
> openafs-client-1.6.11-1.el6.x86_64
> kmod-openafs-1.6.11-1.2.6.32_504.3.3.el6.x86_64
> openafs-krb5-1.6.11-1.el6.x86_64
> 
> 
> Does the 1.6.23 is not compatible with the linux kernel or AFS server version?
> 
> Any information you provided would be appreciated. Thanks.
> 
> 
> Regards,
> Qiulan
> 
> 
> 
> 
> huangql
> ====================================================================
> Computing center,the Institute of High Energy Physics, CAS, China
> Qiulan Huang                       Tel: (+86) 10 8823 6087
> P.O. Box 918-7                       Fax: (+86) 10 8823 6839
> Beijing 100049  P.R. China           Email: huangql@ihep.ac.cn
> ===================================================================