[OpenAFS] 1.6 clients stuck after timeout writing to an overloaded server

Tue, 15 Nov 2011 16:34:36 +0100

Hi All,

for the first time, we had 1.6 clients participating in our server =
stress test procedure, and it didn't work well.

Workload:
 - 128 concurrent jobs on 66 client machines=20
 - each job writes four files of ~700 MB, sequentially, using cp

Server: OpenAFS 1.4.14, Scientific Linux 5.7, amd64, two bonded GbE =
links
Clients:
 -  64 x OpenAFS 1.4.14 (+ gerrit 3286), Scientific Linux 5.7, amd64, =
cache on ext3
 -   2 x OpenAFS 1.6.0  (unpatched), Scientific Linux 6.1, amd64, cache =
on ext4
     (kernel 2.6.32-131.17.1.el6, module was built against GA kernel =
2.6.32-71.el6, kABI deps are met)

Results:
 - the 126 jobs running on the 64 1.4.14 clients finished successfully
 - the two jobs on the two 1.6.0 clients timed out while writing the =
first file, and both clients are now stuck

Error messages from client 2 (695 MB of the first file made it to the =
server):

cp: writing `iso_0001_00000001': No such file or directory
cp: closing `iso_0001_00000001': Connection timed out
failed to write iso_0001_00000001

Corresponding syslog messages:

10:02:11 Lost contact with file server
10:02:11 Lost contact with file server
10:02:12 failed to store file (110)
10:02:22 file server is back up
10:02:22 file server is back up
10:02:32 Lost contact with file server
10:02:32 Lost contact with file server
10:02:52 file server is back up
10:02:52 file server is back up

Error messages from client 1 (335 MB of the first file made it to the =
server):

cp: closing `iso_0002_00000001': Connection timed out
failed to write iso_0002_00000001

Corresponding syslog messages:

09:56:49 Lost contact with file server
09:56:49 Lost contact with file server
09:56:53 blade80 kernel: afs: failed to store file (110)
09:57:18 file server is back up
09:57:18 file server is back up

Alt-SySRq-T logs this for a hanging /bin/ls /afs:

ls            S 0000000000000007     0  5518   5333 0x00000080
ffff88041afdbc08 0000000000000082 0000000000000000 ffffffff8111fb61
ffff880000029d80 ffff880400000002 ffff88041afdbc48 00000001681563f8
ffff8804169c3af8 ffff88041afdbfd8 000000000000f598 ffff8804169c3af8
Call Trace:
[<ffffffff8111fb61>] ? __alloc_pages_nodemask+0x111/0x8b0
[<ffffffffa0403c16>] afs_osi_SleepSig+0xf6/0x1a0 [openafs]
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffffa0403d31>] afs_osi_Sleep+0x71/0xc0 [openafs]
[<ffffffffa0415a43>] afs_GetDCache+0x1923/0x2260 [openafs]
[<ffffffff811377b7>] ? handle_pte_fault+0xf7/0xb50
[<ffffffffa045af19>] afs_linux_readdir+0x169/0xa60 [openafs]
[<ffffffff811860f0>] ? filldir+0x0/0xe0
[<ffffffff811860f0>] ? filldir+0x0/0xe0
[<ffffffff81186370>] vfs_readdir+0xc0/0xe0
[<ffffffff811864f9>] sys_getdents+0x89/0xf0
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

There are also many of those (from a system maintenance task scheduled =
by cron):

perl          S 0000000000000004     0  2607   2606 0x00000080
ffff88041704b718 0000000000000086 0000000000000000 0000000000012b40
000000001704a000 ffff88041704b6a8 ffffffff8112eff9 000000016707a731
ffff88041cbd1af8 ffff88041704bfd8 000000000000f598 ffff88041cbd1af8
Call Trace:
[<ffffffff8112eff9>] ? zone_statistics+0x99/0xc0
[<ffffffffa0403c16>] afs_osi_SleepSig+0xf6/0x1a0 [openafs]
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffffa0403d31>] afs_osi_Sleep+0x71/0xc0 [openafs]
[<ffffffffa0415a43>] afs_GetDCache+0x1923/0x2260 [openafs]
[<ffffffffa043cdaf>] afs_lookup+0x3df/0x1cf0 [openafs]
[<ffffffffa042ce4f>] ? afs_FindVCache+0x2ff/0x5b0 [openafs]
[<ffffffffa0430f0e>] ? afs_GetVCache+0x7e/0x590 [openafs]
[<ffffffffa045dc5a>] afs_linux_dentry_revalidate+0x3ba/0x540 [openafs]
[<ffffffffa042cc32>] ? afs_FindVCache+0xe2/0x5b0 [openafs]
[<ffffffff814dc34e>] ? mutex_lock+0x1e/0x50
[<ffffffffa042d269>] ? afs_PutVCache+0x79/0x140 [openafs]
[<ffffffff8118a7cd>] ? __d_lookup+0xad/0x150
[<ffffffff811806aa>] do_lookup+0x5a/0x220
[<ffffffff81180dd9>] __link_path_walk+0x569/0x820
[<ffffffff81181228>] do_follow_link+0x198/0x440
[<ffffffff81180a7b>] __link_path_walk+0x20b/0x820
[<ffffffff8118175a>] path_walk+0x6a/0xe0
[<ffffffff8118192b>] do_path_lookup+0x5b/0xa0
[<ffffffff81173e61>] ? get_empty_filp+0xa1/0x170
[<ffffffff8118259b>] do_filp_open+0xfb/0xd90
[<ffffffff81041594>] ? __do_page_fault+0x1e4/0x480
[<ffffffff811911f0>] ? mntput_no_expire+0x30/0x110
[<ffffffff8118f1e2>] ? alloc_fd+0x92/0x160
[<ffffffff8116f989>] do_sys_open+0x69/0x140
[<ffffffff8116faa0>] sys_open+0x20/0x30
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

Sometimes the afs_osi_SleepSig is the first line in the call trace, but =
in most cases it's preceded by "? zone_statistics" as above.=20

The client still responds to rxdebug and cmdebug.

This is probably easy to reproduce in our environment, but I'm leaving =
the clients in this state for the moment, just in case there's more data =
to collect.

Any ideas?

Thanks,
	Stephan

bcc: to openafs-bugs

--=20
Stephan Wiesand
DESY -DV-
Platanenallee 6
15738 Zeuthen, Germany