[OpenAFS] Regular service interruptions on a LAN

Marek Szuba scriptkiddie@wp.pl
Wed, 17 Nov 2004 00:16:50 +0100


Hello,

In our labs we have got a small AFS+Kerberos cell of twenty-odd
workstations and one server, all of which run Debian Linux (kernel
2.4.27) and its packaged OpenAFS (version 1.2.11 AFAIR). Basically, the
system works: users can log in from everywhere they are allowed to and
still gain access to their AFS home directories. However, all the AFS
clients, seemingly including the one running on the server itself, tend
to either temporarily lock up or lose connection to the server once in a
while.

The problem seems to be related with write operation on AFS, as the
apparent trigger of this behaviour is increased write activity such as
starting up a complex window manager (e.g. the one from KDE), a large
application (e.g. OpenOffice), downloading a large file or, in case of
the server (rarely, but nevertheless possibly), a large number of users
being active at the same time. When that happens, whatever app caused
the problem just sits there waiting for a while, then (usually,
depending on the app) times out the file operation. On workstations the
situation is usually accompanied by two messages, one after another,
from afsd on the console stating that both IPs of the server have gone
down; on the server there is no such message. After a couple of minutes
the connection is restored (if that indeed is the problem, but that's
what afsd says on workstations) and everything works as before (until
next time, that is), but by then the I/O time-outs have usually kicked
in and whatever app it was that triggered the problem has already died.

Other potentially useful bits of information (if you need to know
anything more, by all means ask):
 - neither low-intensity read/write activity (for sure) nor
high-intensity read activity (AFAIR) trigger the problem: it is for
instance possible to log in in text mode as many times as one wants. Of
course on workstations that doesn't apply to the period when the
connection has already been declared down, as during that time even
shell logon sets one's $HOME to / due to the real home dir being
inaccessible.
 - decreasing the cache size on clients have made the problem less
frequent, but it didn't go away

I have worked as a user in much larger AFS cells and have never
experienced such a behaviour, so obviously something is wrong here. Like
I said, ask if you need any more information. Help will me appreciated!

Best regards,
-- 
Marek Szuba