[OpenAFS] Re: fs: server not responding promptly

John Tang Boyland boyland@cs.uwm.edu
Thu, 10 Feb 2011 15:07:49 -0600


I wrote:
] Since the start of the semester, OpenAFS seems to occasionally hang
] for a few seconds (5? 10?) when trying to do things like write files.
] I finally had it happen while running a script that was doing fs calls,
] and got the message:
] fs:'path-to-directory-in-afs': server not responding promptly

Other people have asked whether this happens on reads.  I haven't
noticed it on reads.  It seems to only happen on writes (and mkdir etc)
and only on the first write in a little while.  After waiting 10 seconds
for a write, later writes are fast.  This seems consistent with it being
the server waiting for callbacks to break.

Someone else suggested I use rxdebug.  I tried using rxdebug twice while
an AFS write was hanging (the results were printed long before the hang
finally went through).  Both times there was no indication of there
being insufficient threads.  Here's one example:

% rxdebug jeremiah.cs.uwm.edu
Trying 129.89.143.70 (port 7000):
Free packets: 370, packet reclaims: 331, calls: 409011, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
10 threads are idle
Connection ... (lots of connections to report).

] My original guess is that the server is hanging while waiting to break callbacks
] from clients that are behind firewalls and not responding.  But even
] running 'fs checks' from all possible clients that are accessing the
] volume doesn't seem to work; at least it still takes a few seconds more.
] But this sort of behavior presumably would drive everyone mad and would
] have been fixed before 1.4.12, so now I'm at a loss.

Someone pointed out that one cannot be sure that all possible clients
have been contacted.  That's true.

But all evidence is pointing back that this is just a known problem: the
server will hang on writes to a volume while waiting to break callbacks.
But the resulting behavior is very annoying and has bad effects (if the
hang is long enough there are I/O errors and applications start to fail).

Are we just the only place with a significant number of AFS clients
behind poorly behaved NAT routers?  That's seems hard to believe.  We are
a tiny cell with only two file servers and about 100 users max.  (But 90% of
which are behind NATs.)  On the other hand, the fact I discover ever new
ways in which Windows 7 doesn't support AFS may means that we have a
relatively large group of naive AFS users, and this has the added effect
that they aren't configuring NAT routers to support AFS and hence the
callback problems mentioned.

Best regards,

John