[OpenAFS] AFS mount hangs when internet connection is lost?
   
    Derrick Brashear
     
    shadow@gmail.com
       
    Thu, 12 Aug 2010 17:05:28 -0400
    
    
  
On Thu, Aug 12, 2010 at 4:54 PM, Ryan C. Underwood
<nemesis-lists@icequake.net> wrote:
>
> I have a system which acts as a NAT router (Ethernet) to share a CDMA
> modem (USB). =A0The same system runs the AFS client which talks to AFS
> fileservers over the internet.
>
> Occasionally the modem is knocked offline, and when this happens the
> Linux USB driver resets the modem. =A0Whenever the modem is knocked
> offline temporarily even once, the /afs mount and all processes that
> were accessing it at the time that it was disconnected permanently hangs
> until the system is rebooted.
>
> The kernel logs show hung_task messages always similar to the following,
> always hanging in afs_PutVCache on each process accessing AFS at the
> time:
>
> [ 4440.472856] INFO: task perl:21072 blocked for more than 120 seconds.
> [ 4440.472861] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disable=
s this message.
> [ 4440.472866] perl =A0 =A0 =A0 =A0 =A0D ffff88008dc8d0b8 =A0 =A0 0 21072=
 =A021071 0x00000000
> [ 4440.472877] =A0ffff8800921b1b58 0000000000000086 ffff880000000000 0000=
000000015900
> [ 4440.472887] =A0ffff8800921b1fd8 0000000000015900 ffff8800921b1fd8 ffff=
8800902196e0
> [ 4440.472897] =A00000000000015900 0000000000015900 ffff8800921b1fd8 0000=
000000015900
> [ 4440.472907] Call Trace:
> [ 4440.472962] =A0[<ffffffffa0638089>] ? afs_PutVCache+0x79/0x140 [openaf=
s]
> [ 4440.472973] =A0[<ffffffff8158730f>] __mutex_lock_slowpath+0xff/0x190
> [ 4440.472982] =A0[<ffffffff815871eb>] mutex_lock+0x2b/0x50
> [ 4440.472991] =A0[<ffffffff8115d7b7>] do_lookup+0x107/0x280
> [ 4440.473000] =A0[<ffffffff8115e1de>] link_path_walk+0x12e/0xab0
> [ 4440.473009] =A0[<ffffffff8115e613>] link_path_walk+0x563/0xab0
> [ 4440.473016] =A0[<ffffffff8115ecc7>] path_walk+0x67/0xe0
> [ 4440.473023] =A0[<ffffffff8115ee9b>] do_path_lookup+0x5b/0xa0
> [ 4440.473031] =A0[<ffffffff8115fb67>] user_path_at+0x57/0xa0
> [ 4440.473039] =A0[<ffffffff81155c4c>] vfs_fstatat+0x3c/0x80
> [ 4440.473047] =A0[<ffffffff81155d6b>] vfs_stat+0x1b/0x20
> [ 4440.473054] =A0[<ffffffff81155d94>] sys_newstat+0x24/0x50
> [ 4440.473063] =A0[<ffffffff8158c46e>] ? do_page_fault+0x15e/0x350
> [ 4440.473071] =A0[<ffffffff81588fb5>] ? page_fault+0x25/0x30
> [ 4440.473080] =A0[<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
>
> Kernel is 2.6.35-13 and OpenAFS is 1.5.75 from the ubuntu repository.
>
> I don't know if it helps, but here is the output of cmdebug -long some
> time after the hang:
>
> $ cmdebug localhost -long
> Lock afs_xvcache status: (none_waiting)
> Lock afs_xdcache status: (none_waiting)
> Lock afs_xserver status: (none_waiting)
> Lock afs_xvcb status: (none_waiting)
> Lock afs_xbrs status: (none_waiting)
> Lock afs_xcell status: (none_waiting)
> Lock afs_xconn status: (none_waiting)
> Lock afs_xuser status: (none_waiting)
> Lock afs_xvolume status: (none_waiting)
> Lock puttofile status: (none_waiting)
> Lock afs_ftf status: (none_waiting)
> Lock afs_xcbhash status: (none_waiting)
> Lock afs_xaxs status: (none_waiting)
> Lock afs_xinterface status: (none_waiting)
> Lock afs_xosi status: (none_waiting)
> Lock afs_xsrvAddr status: (none_waiting)
> Lock afs_xvreclaim status: (none_waiting)
> Lock afsdb_client_loc status: (none_waiting)
> Lock afsdb_req_lock status: (none_waiting)
> Lock afs_discon_lock status: (none_waiting, 1 read_locks(pid:0))
> Lock afs_disconDirtyL status: (none_waiting)
> Lock afs_discon_vc_di status: (none_waiting)
> Lock dynroot status: (none_waiting)
> Lock icequake.net status: (none_waiting)
> ** Cache entry @ 0x8dc8c000 for 0.1.1.1 [dynroot]
> =A0 =A0 =A0 =A0 =A0 =A02048 bytes =A0DV =A0 =A0 =A0 =A0 =A0 =A03 =A0refcn=
t =A0 =A0 3
> =A0 =A0callback 00000000 =A0 expires 0
> =A0 =A00 opens =A0 =A0 0 writers
> =A0 =A0volume root
> =A0 =A0states (0x5), stat'd, read-only
> ** Cache entry @ 0x8dc8d400 for 2.536870916.1.1 [icequake.net]
> =A0 =A0locks: (writer_waiting, write_locked(pid:18986 at:54), 1 waiters)
ok, but what was this pid?
you'll want 1.5.76 shortly, for other reasons.
--=20
Derrick