[OpenAFS] Re: OpenAFS freeze problems

Mon, 13 Feb 2012 13:03:15 -0600

On Mon, 13 Feb 2012 12:31:53 -0600
John Tang Boyland <boyland@pabst.cs.uwm.edu> wrote:

> This cell has about 40 students on it accessing files on three servers
> using their laptops which probably have firewalls causing them to
> ignore callback requests.  Unless the OpenAFS installation process
> opens up 7001 to outside access

What platform are the clients? The windows installation process I
believe does indeed (try to) open 7001.

> there's basically nothing I can do about this bad behavior.

Well, "nothing" is a bit far. On the fileserver-side, you can restrict
access to those files to only people who use non-broken clients. You can
block clients by IP at the network level if you know what IP the
requests are coming from that do not respond to callback breaks.

>From a more general standpoint, you can fix the clients. If they're
ignoring callback breaks, they are going to continue to cause problems
like this, and the clients themselves are likely to see stale/incorrect
data.

> My guess is that the server's threads all get used up waiting for
> callback breaks to be ack'ed and so the fileserver stops responding.
> But is there something more I can do to find out why the freeze is
> happening?  Is there some rxdebug command that I can run when a freeze
> happens?

I'm a little confused by this... by 'freeze' you mean everything on the
server is inaccessible? Or just that the write takes over 30 seconds to
do anything, and requests to the same file stall? If the latter, there's
not much you can do about that; we must break callbacks before the write
completes, and if someone is not responding to a callback break, we need
to wait some seconds to ensure we've tried hard enough to inform them.

To see if you're running out of threads, running
'rxdebug <fileserver> -noconn' will tell you how many threads are idle
and how many requests are waiting for a free thread. If you want to see
in general what the fileserver is blocked on, you can look at a core of
the fileserver process. However, if you just think that it's a host
ignoring callback breaks... that seems pretty likely to be all that it
is.

> Is there a simple solution -- like tuning a parameter (more threads?)
> that could make this behavior less common?

If you're using -L or '-p 128', the threads are already the highest they
can go for a 1.4 fileserver. For a 1.6 fileserver you can go to... 256,
was it?

But that's not going to help if the problem is unrelated to running out
of threads.

-- 
Andrew Deason
adeason@sinenomine.net