[OpenAFS] Re: fs: server not responding promptly

Andrew Deason adeason@sinenomine.net
Wed, 9 Feb 2011 23:47:37 -0600

On Wed, 09 Feb 2011 22:44:30 -0600
John Tang Boyland <boyland@pabst.cs.uwm.edu> wrote:

> Since the start of the semester, OpenAFS seems to occasionally hang
> for a few seconds (5? 10?) when trying to do things like write files.

Just writes? Reads are okay?

> The FileLog for the server (jeremiah.cs.uwm.edu) from the appropriate time has:
> ...
> Wed Feb  9 22:14:06 2011 CB: ProbeUuid for 999.102.202.55:2841 failed -01
> Wed Feb  9 22:15:07 2011 CheckHost_r: Probing all interfaces of host 999.35.48.249:56648 failed, code -01
> Wed Feb  9 22:15:09 2011 CB: ProbeUuid for 999.131.13.134:7001 failed -01
> Wed Feb  9 22:16:04 2011 CB: ProbeUuid for 999.59.5.145:63713 failed -01
> Wed Feb  9 22:16:05 2011 CB: WhoAreYou failed for host gge5870 (999.30.179.54:7001), error -01
> Wed Feb  9 22:16:12 2011 CB: ProbeUuid for 999.100.203.66:53467 failed -01
> Wed Feb  9 22:16:36 2011 CB: ProbeUuid for 999.229.195.248:49341 failed -01
> Wed Feb  9 22:18:12 2011 CB: ProbeUuid for 999.102.202.55:2846 failed -01
> Wed Feb  9 22:19:14 2011 CB: ProbeUuid for 999.131.13.134:12627 failed -01

You probably have many clients that are not reachable and/or are behind
NATs. That will cause writes to be delayed, since any client that has
read that file recently will need to be contacted by the fileserver. If
that client is gone or for some reason cannot be contacted, it's going
to take a little time for the request to timeout.

> I'm running the file server with all default values.
> Perhaps I need to tune the number of daemons?

I'd start with '-L -cb 640000', which should handle the most likely
issues you're hitting. If you run 'rxdebug <server>' you should see a
couple of lines like

X calls waiting for a thread
Y threads are idle

Generally you want X to be 0. If it's not 0 around when you're seeing
delays, that would certainly explain it. -L will increase the number of
threads to make that less likely to happen.

> My original guess is that the server is hanging while waiting to break
> callbacks from clients that are behind firewalls and not responding.
> But even running 'fs checks' from all possible clients that are
> accessing the volume doesn't seem to work; at least it still takes a
> few seconds more.

Unless your cell is in a closed environment, or you're writing to an
area where you're the only one who has any access, you haven't really
checked "all possible clients".

However, do you mean that it takes several seconds to write a small
amount of data... and if you immediately write some more data to the
same file again, there is _still_ a delay? In that situation, it is
unlikely (but possible!) to be a problem with breaking callbacks. Since,
once you break a callback, it's gone until the client accesses the file

Andrew Deason