[OpenAFS] httpd blocked

Jonathan Nilsson jnilsson@uci.edu
Mon, 20 Jun 2011 16:19:01 -0700


hello,

this past weekend our webserver, which serves pages from AFS, crashed and I 
found several messages like the following in /var/log/messages:

Jun 18 13:19:51 web1 kernel: INFO: task httpd:26383 blocked for more than 120 
seconds.
Jun 18 13:19:51 web1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
Jun 18 13:19:51 web1 kernel: httpd         D 0001B845  2032 26383  32143 
  26384 26382 (NOTLB)
Jun 18 13:19:51 web1 kernel:        c7449e48 00000082 1e778e40 0001b845 00000046 
00000002 f887e080 00000007
Jun 18 13:19:51 web1 kernel:        dff56000 1e7e1fb4 0001b845 00069174 00000000 
dff5610c c3012900 f3e77740
Jun 18 13:19:51 web1 kernel:        f24491e0 00000000 00000000 ea22cb80 00000000 
00000040 00000000 ea22cb80
Jun 18 13:19:51 web1 kernel: Call Trace:
Jun 18 13:19:51 web1 kernel:  [<f964f78d>] afs_access+0x320/0x337 [openafs]
Jun 18 13:19:51 web1 kernel:  [<c061d975>] __mutex_lock_slowpath+0x4d/0x7c
Jun 18 13:19:51 web1 kernel:  [<c061d9b3>] .text.lock.mutex+0xf/0x14
Jun 18 13:19:51 web1 kernel:  [<c048219b>] do_lookup+0x7a/0x174
Jun 18 13:19:51 web1 kernel:  [<c0483fc8>] __link_path_walk+0x87a/0xd4b
Jun 18 13:19:51 web1 kernel:  [<c04844d1>] link_path_walk+0x38/0x95
Jun 18 13:20:24 web1 kernel:  [<c0484892>] do_path_lookup+0x219/0x27f
Jun 18 13:20:24 web1 kernel:  [<c0484fec>] __user_walk_fd+0x29/0x3a
Jun 18 13:20:24 web1 kernel:  [<c0474e92>] sys_faccessat+0x93/0x126
Jun 18 13:20:24 web1 kernel:  [<c044bf62>] audit_syscall_entry+0x15a/0x18c
Jun 18 13:20:24 web1 kernel:  [<c0474f34>] sys_access+0xf/0x13
Jun 18 13:20:24 web1 kernel:  [<c0404f17>] syscall_call+0x7/0xb

this system is CentOS 5.5 (so it is quite out of date with several packages) 
32bit with OpenAFS 1.4.14. other AFS clients did not have any problems that we 
are aware of, but this web server is under the heaviest load.

i suspect that the system kept spawning httpd processes as old ones got blocked 
and eventually it ran out of memory and became unresponsive. after a reboot it 
works fine. so the question is, what caused the afs cache manager to respond so 
slow?

can anyone confirm if they have seen kernel messages like this? how can i 
confirm if the problem is with the client or the server? i see no error messages 
in BosLog, FileLog, or VolserLog on our servers...

i may need to adjust the afsd or fileserver/volserver arguments.
the client's /etc/sysconfig/openafs
AFSD_ARGS="-dynroot -fakestat-all -daemons 6 -volumes 500 -chunksize 20 -blocks 
5242880"

our servers' BosConfig lines for fileserver and volserver
parm /usr/afs/bin/fileserver -L
parm /usr/afs/bin/volserver -p 128

i saw Russ Allbery's recent message on another thread that he uses these 
parameter's on the fileserver, so i can try that:

/usr/lib/openafs/fileserver -L -l 1000 -s 1000 -vc 1000 -cb 200000 \
     -rxpck 800 -udpsize 1048576 -busyat 200 -vattachpar 4

thanks,

--Jonathan




-- 
Jonathan.Nilsson@uci.edu
Computing Services
School of Social Sciences
SSPA 4110 | 949.824.1536