[OpenAFS] Transarc AFS 3.6 2.5 server release AAARRRGGGHHH!!!!

Jared Spencer jareds@us.ibm.com
Fri, 22 Dec 2000 09:28:08 -0500

Here is some more information on this problem:

First, it appears that the scenario of NT clients being unable to access
volumes on a fileserver is specific to a 3.6 Patch 1 fileserver and a *GA
release* NT client.  Other NT clients (and UNIX clients for that matter)
could become internally marked down by a fileserver, but this would only be
under specific circumstances and is recoverable from.  Read on....

AFS 3.6 patch 1 contains a delta to address Denial of Service types of
attacks from rogue clients.  The release notes generically describe this

     This fix resolves a problem that occurred when an AFS File Server
     received requests from AFS Client machines to which it could not
     respond. The requests locked up threads in the File Server and
     rendered the server unusable.

Basically, if the fileserver is unable to respond to a client, it
exponentially backs off the responses until it eventually marks the client
as down.  Once a client is marked down by the fileserver, that entry can be
cleared by restarting the fileserver.  If you suspect that a fileserver has
marked a client down, one of the initial symptoms is that fs checks returns
quickly showing the fileserver down when, in fact, it is up.  You can
verify further by running tcpdump and looking for the following:

[jared] tcpdump -l -N -s 256 port 7000 and host cube and host afstest58
tcpdump: listening on hme0
10:15:41.531302 afstest58.cm.e9052e9c.0460 > cube.fs:  1.046e data C.L.   4
10:15:41.532957 afstest58.cm.e9052e9c.0460 < cube.fs:  0.0407 abort ....
4 -1 (DF)

Here, the fileserver has immediately returned an abort to the GetTime
request from the client.  This is a good indication the fileserver has this
client marked down.  Under normal circumstances, there are very few
scenarios where a "good" client will get marked down.  And, if this
happens, the fileserver can be restarted to clear the entry for that host.

But enter the 3.5 GA release NT client.

In the 3.5 GA release NT client, there was a bug that would cause it to
fail to return an appropriate code to a fileserver's WhoAreYou requests.
At the time, this was basically a benign bug.  The client still functioned
normally, and the only indication that the call was failing was from the
WhoAreYou failed messages in the FileLog.  However, with the new 3.6 Patch
1 fileserver, we believe it's now seeing these failed RPCs as failure to
respond, and is marking the client as down.  Although restarting the
fileserver clears the host entry, in this case it will just quickly get
marked down again.  This NT client bug was fixed in 3.5 Patch 1.

This is a preliminary analysis, and we're continuing to investigate this
here.  If you have any further questions, please contact AFS support.

Also, there was a response to this thread identifying the -dontdelay switch
that has been added.  This is actually unrelated to the DoS delta.  This
switch was to address a situation where clients would spend a long time
trying to stat a directory in which they didn't have appropriate
permissions, and this fix is not yet in the released code for either 3.5 or



Jared Spencer
AFS Technical Lead
Staff Software Engineer
IBM Transarc Labs

Nathan Neulinger <nneul@umr.edu>@openafs.org on 12/18/2000 07:26:27 PM

Sent by:  openafs-info-admin@openafs.org

To:   openafs-info@openafs.org, info-afs@transarc.com
Subject:  [OpenAFS] Transarc AFS 3.6 2.5 server release AAARRRGGGHHH!!!!

Today we did a major AFS server upgrade that has been delayed for far
too long. Unfortunately thanks to an (apparently, I might have missed
it, but it sucks regardless) undocumented new feature in the 2.5 patch
release - we wound up having problems for about 8 hours with NT clients
unable to talk to the server.

Apparently the 2.5 code has a nwe delta in it that causes servers to
stop talking to clients completely when it thinks that they are flooding
the server with requests. There is no way to tell that this is
happening, no way to shut it off, and no way to get a list of affected

In our case, almost 1500 NT stations with the AFS client had extremely
sporadic and unstable access to afs. What would happen is - the fs
checks output woult include all of the servers running 2.5, or some
selection of them.

So - if you're thinking of upgrading to 3.6 2.5, think twice, or at
least be very cautious about it. Shutting down all your clients ahead of
time, and SLOWLY bringing them back up might help, but that's hardly an
option with 1500 stations.

The end result - we wound up backing down to 3.6 2.3 (a huge upgrade
from the 3.4a 5.53 we were running), which fixed the problem.

(To openafs gatekeepers - if you get a delta from transarc/ibm [yeah
right!] that includes this, I _STRONGLY_ suggest that you say 'no
thanks!' or at the very least, make the feature optional.)

BTW - this doesn't seem to affect unix clients. It only affected the NT
clients, runnig 3.5 or 3.6, didn't seem to matter which, although the
particular behavior differed with 3.5 and 3.6.

-- Nathan

Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
CIS - Systems Programming                Fax: (573) 341-4216
OpenAFS-info mailing list