[OpenAFS-devel] 1.4.1-rc10 client hung in afs_WriteDCache() on Linux kernel 2.6.9-34

Rainer Toebbicke rtb@pclella.cern.ch
Wed, 05 Apr 2006 14:07:33 +0200


Hello,


I've got a one-liner here that hangs the AFS client: it causes the 
afs_WriteThroughDSlots process to take the afs_xdcache lock at 620, 
then call afs_WriteDCache() which appears to start looping inside
the generic_file_aio_write() stack in the routine 
balance_dirty_pages(). The exact traceback is


ESP        EIP        Function (args)
0xda38fbb4 0xc02db9d5 schedule+0x83d (0xc14089a8, 0xc14089a8, 
0x390937, 0x1, 0xdead4ead)
0xda38fc1c 0xc02dc271 schedule_timeout+0xd3 (0x64)
0xda38fc58 0xc02dc194 io_schedule_timeout+0x26 (0x0, 0xda28cd10, 
0xc01202af, 0xda38fc90, 0xda38fc90)
0xda38fc64 0xc022a9db blk_congestion_wait+0x64 (0xc1642f08, 0x3198, 
0x7210, 0xc1642f08, 0x0)
0xda38fcb4 0xc0144777 balance_dirty_pages+0xbe (0xe087f820, 0xc24, 0x2c)
0xda38fd14 0xc0144828 balance_dirty_pages_ratelimited+0x53 
(0xc011e7bf, 0x2c, 0xbf8, 0xe0b51280, 0x2c)
0xda38fd28 0xc0141b57 generic_file_buffered_write+0x39a (0x8c24, 0x0, 
0xda38feb8, 0x0, 0x2c)
0xda38fdb8 0xc0141fc2 __generic_file_aio_write_nolock+0x389 
(0xda38feb8, 0x1, 0xda38fe40, 0x8bf8, 0x0)
0xda38fe10 0xc0142029 generic_file_aio_write_nolock+0x39 (0xda38feb8, 
0xd43ddf18, 0xe0b51254, 0x2c, 0xd43dde68)
0xda38fe38 0xc0142213 generic_file_aio_write+0x72 (0x8bf8, 0x0)
0xda38fe5c 0xe0865d76 [ext3]ext3_file_write+0x19 (0x8bf8, 0x0, 
0xda38fe9c, 0xc0339cd0, 0x0)
0xda38fe74 0xc015a988 do_sync_write+0x97 (0xd4bb21c4, 0x2c, 0x0, 
0xffffffff, 0xffffffff)
0xda38ff10 0xe140c7f5 [libafs]osi_rdwr+0xde (0xe0b51254, 0x2c, 
0xda38ff3c, 0x1, 0x8bf8)
0xda38ff3c 0xe140c4bc [libafs]afs_osi_Write+0xb5 (0x2c)
0xda38ff78 0xe13d369a [libafs]afs_WriteDCache+0x3e (0xe0b08c68, 
0xe0b08a88, 0x4, 0x44339257, 0xe1438f68)
0xda38ff84 0xe13d2eb5 [libafs]afs_WriteThroughDSlots+0x1b3 (0xd730ca, 
0x4433920e, 0x443389a2, 0x44338766, 0x443391d7)
0xda38ffb0 0xe13cce84 [libafs]afs_Daemon+0x12e (0xda28cf36, 
0xe1424805, 0xe14115b9, 0x0)
0xda38ffdc 0xe141179d [libafs]afsd_thread+0x1e4


I say looping as I was unable to catch and "kdb" it anywhere else. 
This is weak evidence as the bulk of the time actually passes in 
blk_congestion_wait(). Slightly stronger is that a subsequently placed 
breakpoint on osi_rdwr did never catch, and that ssb'ing in kdb 
appeared to not leave the routine.

The machine still runs, but everything that needs AFS hangs 
(supposedly due to the afs_xdcache lock being held). If ever that lock 
*did* get released I did not manage to see it.

As this is so easy to reproduce and does not require privileges I 
won't publish the "how" here. If anybody volunteers and does not 
bluntly say he's going to hang all our clients I'll tell him/her in a 
private mail.

BTW: 2.4.21 kernel hangs as well, but haven't checked yet where exactly.


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155