[OpenAFS-devel] Changing timeouts... very slow client recovery to server outage for RO volume

Neulinger, Nathan nneul@umr.edu
Thu, 31 Jan 2002 11:24:37 -0600


If I have a file server/vol server that I have signaled with -STOP (i.e. =
suspend them so that they will still be listening on port, but not doing =
anything with received packets), and then issue an afs call on a client =
against a volume on the server - it hangs.

That's expected.

What controls how long the client sits there before generating the =
"Connection timed out" message?

Is that mechanism different for a readonly volume located on that =
server?

-------

The reason I ask - I was finally able to reproduce a symptom we have =
seen MANY times when we've had problems with a server that would cause =
the clients to be unusable. Note - this isn't a reproduction of the =
problem with the server, but something I can do on the server that =
reproduces the symptom we've seen.

	Add vol1 on srv1
	Create some large files on the volume
	Replicate to srv1 and srv2
	Set server prefs on client to prefer srv1 (not sure this is necessary)
	Access the volume, but don't touch one of the files, cd into it and ls =
or something
	Kill -STOP both the fileserver and volserver on srv1.
	Now, attempt to access one of the files you haven't accessed. (I did a =
tail on a 500MB file.)

It was almost 10 minutes before the client station gave up and switched =
to srv2. Needless to say, in an environment that makes heavy use of afs, =
if that volume was root.cell, it would have caused a significant problem =
(in some cases, would have brough to their knees) any client that was =
pointed at that server.

In the case of a read-write volume doing the same set of steps, it took =
about 30-45 seconds before it switched.

If there is another read-only server, I would like it to switch after =
10-15 seconds at most. Maybe there is some reason that's a bad idea, if =
so, I'd like to discuss it.

BTW, This was on a linux client running -current on 2.4.18pre7, file+vol =
server srv1 same kernel and afs, and file+vol server srv2 running an =
older (around June 2001) file+vol server on solaris.

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216