[OpenAFS] Re: 1.5.77 debian client hangs at new ip address

Gémes Géza geza@kzsdabas.hu
Mon, 11 Oct 2010 05:50:35 +0200

2010-10-11 04:03 keltezéssel, Andrew Deason írta:
> On Sun, 10 Oct 2010 19:37:54 +0200
> Gémes Géza <geza@kzsdabas.hu> wrote:
>> I have a debian squeeze box with 1.5.75 (only the client parts from
>> debs) installed. Everything works fine until I give a new IP address to
>> the box. Then the afsd seems to hang (after about 20-30 seconds /afs
>> becomes unavailable: eg. ls /afs never returns). What may be worth
>> mentioning, is that the box's ip address (the fixed, not the newly
>> assigned) is member of a group which has some rights defined.
> Run 'cmdebug <client>' while hanging; it will tell you what locks are
> being held and waited on. If you could see if the same thing happens
> with and 1.5.77 clients, it would also help.
> Of course, the problem may not be on the client side: the fileserver may
> just not be talking to you. Do you know the versions of the fileservers
> involved?
I've upgraded now to 1.5.77, and the problem persist. But it is a little
bit "lighter" the afsd process doesn't block completely: I'm able to ls
/afs and after some 1~2 minutes even the apache httpd loads some pages
out of /afs, but on the system logs there is:

INFO: task apache2:2766 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
apache2       D ffff880003449780     0  2764   2758 0x00000000
ffff88003dcf6350 0000000000000282 0000000000000200 ffff8800326efc00
0000000000000021 ffffffff810fd611 000000000000f9e0 ffff88003e70bfd8
0000000000015780 0000000000015780 ffff880002945c40 ffff880002945f38
Call Trace:
 [<ffffffff810fd611>] ? __d_instantiate+0x54/0xbd
 [<ffffffff810fdcbd>] ? d_splice_alias+0xc1/0xc9
 [<ffffffffa020d227>] ? afs_linux_lookup+0x1a3/0x1da [openafs]
 [<ffffffff8100e1e2>] ? check_events+0x12/0x20
 [<ffffffff812fa07a>] ? __mutex_lock_common+0x122/0x192
 [<ffffffffa01ea955>] ? afs_access+0x551/0x5f0 [openafs]
 [<ffffffff812fa1a2>] ? mutex_lock+0x1a/0x31
 [<ffffffff810f5b54>] ? do_lookup+0x80/0x15d
 [<ffffffff810f65d4>] ? __link_path_walk+0x5a5/0x6f5
 [<ffffffff8100e1e2>] ? check_events+0x12/0x20
 [<ffffffff810f6952>] ? path_walk+0x66/0xc9
 [<ffffffff810f7dbc>] ? do_path_lookup+0x20/0x77
 [<ffffffff810f7f48>] ? do_filp_open+0xe5/0x94b
 [<ffffffff8100e1cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81101351>] ? alloc_fd+0x67/0x10c
 [<ffffffff810ec987>] ? do_sys_open+0x55/0xfc
 [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b

and cmdebug reports:
Lock afsdb_client_loc status: (none_waiting, write_locked(pid:2766 at:685))
Lock afs_discon_lock status: (none_waiting, 1 read_locks(pid:0))

The fileserver is (still) a 1.4.7 from debian stable. Plan to upgrade
the server as well to 1.5.77 (it's part of a bigger migration plan and
I've got stuck with this box which is going to be a webserver)

Thanks for any advice