[OpenAFS] Help: intermittent fileservice hangs

Tracy Di Marco White afs-info@gendalia.org
Tue, 03 Feb 2004 16:09:41 -0600


In message <Pine.GSO.4.58-035.0401271302440.29100@johnstown.andrew.cmu.edu>, De
rrick J Brashear writes:
>On Tue, 27 Jan 2004 cball@bu.edu wrote:
>
>> Over the past weekend we had numerous, intermittent AFS access problems.
>
>Guess: you were running out of threads.
>
>rxdebug (fileserver) 7000
>when it's happening.

So we had this happen on our busiest file server last Wednesday.

grep wait /tmp/rx10 | wc -l
       551
...
   call 0: # 5127, state precall, mode: eof, flags: waiting_for_process receive_done, has_input_packets

This happened a little before 5pm, and gradually got better, by 5:30 it
was down to 243.  The fileserver is currently running Transarc, but I'm
in the process of upgrading our file servers to OpenAFS.

I turned up fileserver logging, and that was rather verbose but I don't
know what I'm looking for.  I don't think the logging was turned up
soon enough to be of much use, since it was more or less when things were
getting better.

Before I turned up logging, possibly when things were getting bad:
Wed Jan 28 16:28:16 2004 CB: RCallBackConnectBack (host.c) failed for host 81badf78.7001
Wed Jan 28 16:29:12 2004 CB: RCallBackConnectBack (host.c) failed for host 81ba91af.62671
Wed Jan 28 16:29:44 2004 CB: WhoAreYou failed for 81ba8c1e.7001, error -3
Wed Jan 28 16:30:08 2004 CB: RCallBackConnectBack (host.c) failed for host 81ba0ab4.7001
Wed Jan 28 16:30:20 2004 CB: WhoAreYou failed for 81ba8c1d.7001, error -3
Wed Jan 28 16:30:28 2004 CB: WhoAreYou failed for 81ba8c15.7001, error -3
Wed Jan 28 16:31:04 2004 CB: RCallBackConnectBack (host.c) failed for host 81ba0ac1.7001
Wed Jan 28 16:32:00 2004 CB: RCallBackConnectBack (host.c) failed for host 81ba9399.7001
Wed Jan 28 16:32:57 2004 CB: RCallBackConnectBack (host.c) failed for host 81ba9327.7001
Wed Jan 28 16:33:01 2004 CB: WhoAreYou failed for 81ba8c1c.7001, error -3
Wed Jan 28 16:33:02 2004 CB: WhoAreYou failed for 81ba8c1d.7001, error -3
Wed Jan 28 16:33:40 2004 CB: WhoAreYou failed for 81ba8c16.7001, error -3
Wed Jan 28 16:33:53 2004 ProbeUuid failed for host 81ba3e8b.7001
Wed Jan 28 16:33:59 2004 CB: WhoAreYou failed for 81ba8c15.7001, error -3
Wed Jan 28 16:34:02 2004 ProbeUuid failed for host 81ba424a.7001
Wed Jan 28 16:34:58 2004 ProbeUuid failed for host 81ba3480.7001
Wed Jan 28 16:35:02 2004 CB: WhoAreYou failed for 81ba8c27.7001, error -3
Wed Jan 28 16:35:54 2004 ProbeUuid failed for host 81bac07f.7001
Wed Jan 28 16:36:50 2004 ProbeUuid failed for host 81ba932f.7001
Wed Jan 28 16:37:06 2004 CB: WhoAreYou failed for 81ba8c1e.7001, error -3
Wed Jan 28 16:37:22 2004 CB: WhoAreYou failed for 81ba8c25.7001, error -3
Wed Jan 28 16:37:46 2004 ProbeUuid failed for host 81ba0ac2.7001
Wed Jan 28 16:39:01 2004 ProbeUuid failed for host 81ba909b.7001
Wed Jan 28 16:39:53 2004 CB: WhoAreYou failed for 81ba8c28.7001, error -3
Wed Jan 28 16:39:57 2004 ProbeUuid failed for host 81baf640.7001
Wed Jan 28 16:40:00 2004 CB: WhoAreYou failed for 81ba8c1d.7001, error -3
Wed Jan 28 16:40:05 2004 ProbeUuid failed for host 81ba8ee4.7001
Wed Jan 28 16:40:07 2004 CB: WhoAreYou failed for 81ba8c19.7001, error -3
Wed Jan 28 16:40:12 2004 CB: WhoAreYou failed for 81ba8c29.7001, error -3


a little bit later, when no one could do things if their space was on afs-10:
Wed Jan 28 16:55:31 2004 CB: XCallBackBulk failed, host=81ba8c2a.7001; callback list follows:
Wed Jan 28 16:55:31 2004 CB: Host 81ba8c2a.7001, file 537818307.412.3015 (part of bulk callback)
Wed Jan 28 16:55:31 2004 CB: Host 81ba8c2a.7001, file 537947901.1.1 (part of bulk callback)
Wed Jan 28 16:55:31 2004 BreakDelayedCallbacks FAILED for host 81ba8c2a which IS UP.  Possible network or routing failure.
Wed Jan 28 16:55:31 2004 MultiProbe failed to find new address for host81ba8c2a.7001
Wed Jan 28 16:55:32 2004 CB: XCallBackBulk failed, host=81ba8c1d.7001; callback list follows:
Wed Jan 28 16:55:32 2004 CB: Host 81ba8c1d.7001, file 537925346.698.2591 (part of bulk callback)
Wed Jan 28 16:55:32 2004 BreakDelayedCallbacks FAILED for host 81ba8c1d which IS UP.  Possible network or routing failure.
Wed Jan 28 16:55:32 2004 MultiProbe failed to find new address for host81ba8c1d.7001
Wed Jan 28 16:56:02 2004 CB: RCallBackConnectBack (host.c) failed for host cd828c1.3264
Wed Jan 28 16:56:58 2004 CB: RCallBackConnectBack (host.c) failed for host 81ba4677.7001

>> read-write below the WWW root directory.  Switching the WWW root into
>> "maintenance mode" (an alternate root directory volume with read-only
>> mounts) solved the problem.  Monday the original scenario was restored;
>> we've gone 30 hours without a relapse.
>
>Too many callbacks being broken, perhaps.

I'm sure this is at least part of my problem, is there a way to fix it
or raise the limits?

>> Fileserver and database server logs showed nothing out of the ordinary.
>> This is a fundamental concern; while different options may be appropriate
>> it is quite disturbing to transition into a non-functional state with
>> nothing in /usr/afs/logs [that I understand] to indicate a problem.
>
>kill -TSTP fileserver-pid
>
>turns up the logging, which goes in /usr/afs/logs/FileLog
>
>
>> "-p <#processes>" options which appear to be interesting.  Is there a way
>> to query or log utilization levels or to get an indication when limits are
>> exceeded?
>
>Look at the threads as above. the xstat_fs_test program also exposes some
>useful data
>
>> What can or should be monitored to expose (and log) activity levels,
>> timeouts, etc.

I use big brother to monitor stuff, I'm writing a little module to check
for the number of waiting_for_process and alert me if it goes above some
arbitrary number.

-Tracy