[OpenAFS] fs: server not responding promptly

John Tang Boyland boyland@cs.uwm.edu
Wed, 09 Feb 2011 22:44:30 -0600

Since the start of the semester, OpenAFS seems to occasionally hang
for a few seconds (5? 10?) when trying to do things like write files.
I finally had it happen while running a script that was doing fs calls,
and got the message:
fs:'path-to-directory-in-afs': server not responding promptly

The FileLog for the server (jeremiah.cs.uwm.edu) from the appropriate time has:
Wed Feb  9 22:14:06 2011 CB: ProbeUuid for 999.102.202.55:2841 failed -01
Wed Feb  9 22:15:07 2011 CheckHost_r: Probing all interfaces of host 999.35.48.249:56648 failed, code -01
Wed Feb  9 22:15:09 2011 CB: ProbeUuid for 999.131.13.134:7001 failed -01
Wed Feb  9 22:16:04 2011 CB: ProbeUuid for 999.59.5.145:63713 failed -01
Wed Feb  9 22:16:05 2011 CB: WhoAreYou failed for host gge5870 (999.30.179.54:7001), error -01
Wed Feb  9 22:16:12 2011 CB: ProbeUuid for 999.100.203.66:53467 failed -01
Wed Feb  9 22:16:36 2011 CB: ProbeUuid for 999.229.195.248:49341 failed -01
Wed Feb  9 22:18:12 2011 CB: ProbeUuid for 999.102.202.55:2846 failed -01
Wed Feb  9 22:19:14 2011 CB: ProbeUuid for 999.131.13.134:12627 failed -01

(I have obscured the first byte in each network id.)

% rxdebug jeremiah.cs.uwm.edu -version
Trying (port 7000):
AFS version:  OpenAFS 1.4.12 built  2010-03-09 
% fs --version
openafs 1.4.3
% rxdebug localhost -version -port 7001
Trying (port 7001):
AFS version:  OpenAFS 1.4.11 built  2009-07-13 

People notice the delays on Windows machines, MacOSX and on Solaris.
(The machine I caught it on above was solaris 10.)

On MacOSX and Windows, the delays are particularly disturbing
because they are long enough for the OS to time out
and give the application an IO error.  This causes the application
to say the files aren't there anymore, which is highly disturbing
to my students.

I'm running the file server with all default values.
Perhaps I need to tune the number of daemons?

My original guess is that the server is hanging while waiting to break callbacks
from clients that are behind firewalls and not responding.  But even
running 'fs checks' from all possible clients that are accessing the
volume doesn't seem to work; at least it still takes a few seconds more.
But this sort of behavior presumably would drive everyone mad and would
have been fixed before 1.4.12, so now I'm at a loss.

Suggestions always appreciated,