[OpenAFS] fs: server not responding promptly
John Tang Boyland
Wed, 09 Feb 2011 22:44:30 -0600
Since the start of the semester, OpenAFS seems to occasionally hang
for a few seconds (5? 10?) when trying to do things like write files.
I finally had it happen while running a script that was doing fs calls,
and got the message:
fs:'path-to-directory-in-afs': server not responding promptly
The FileLog for the server (jeremiah.cs.uwm.edu) from the appropriate time has:
Wed Feb 9 22:14:06 2011 CB: ProbeUuid for 9188.8.131.52:2841 failed -01
Wed Feb 9 22:15:07 2011 CheckHost_r: Probing all interfaces of host 9184.108.40.206:56648 failed, code -01
Wed Feb 9 22:15:09 2011 CB: ProbeUuid for 9220.127.116.11:7001 failed -01
Wed Feb 9 22:16:04 2011 CB: ProbeUuid for 918.104.22.168:63713 failed -01
Wed Feb 9 22:16:05 2011 CB: WhoAreYou failed for host gge5870 (922.214.171.124:7001), error -01
Wed Feb 9 22:16:12 2011 CB: ProbeUuid for 9126.96.36.199:53467 failed -01
Wed Feb 9 22:16:36 2011 CB: ProbeUuid for 9188.8.131.52:49341 failed -01
Wed Feb 9 22:18:12 2011 CB: ProbeUuid for 9184.108.40.206:2846 failed -01
Wed Feb 9 22:19:14 2011 CB: ProbeUuid for 9220.127.116.11:12627 failed -01
(I have obscured the first byte in each network id.)
% rxdebug jeremiah.cs.uwm.edu -version
Trying 18.104.22.168 (port 7000):
AFS version: OpenAFS 1.4.12 built 2010-03-09
% fs --version
% rxdebug localhost -version -port 7001
Trying 127.0.0.1 (port 7001):
AFS version: OpenAFS 1.4.11 built 2009-07-13
People notice the delays on Windows machines, MacOSX and on Solaris.
(The machine I caught it on above was solaris 10.)
On MacOSX and Windows, the delays are particularly disturbing
because they are long enough for the OS to time out
and give the application an IO error. This causes the application
to say the files aren't there anymore, which is highly disturbing
to my students.
I'm running the file server with all default values.
Perhaps I need to tune the number of daemons?
My original guess is that the server is hanging while waiting to break callbacks
from clients that are behind firewalls and not responding. But even
running 'fs checks' from all possible clients that are accessing the
volume doesn't seem to work; at least it still takes a few seconds more.
But this sort of behavior presumably would drive everyone mad and would
have been fixed before 1.4.12, so now I'm at a loss.
Suggestions always appreciated,