[OpenAFS] Help: intermittent fileservice hangs

Tue, 27 Jan 2004 12:44:20 -0500 (EST)

Over the past weekend we had numerous, intermittent AFS access problems.

The symptom on a directly connected login was typically an access delay up
to 5 minutes after which pending access (read or write) would complete
successfully. These hang's cleared at approximately the same time across
all afs clients.  Initially this occurred a couple times an hour; at one
point it was a near continuous cycle of hang, release, hang ...

The problems occurred for volumes on several fileservers.  When it
occurred, access to volumes on different servers were often affected as
well.  For example, given:

	/afs/bu/data	    readonly on afs-fs1 & afs-fsro
	/afs/bu/user/user1  on afs-fs2
	/afs/bu/user/user2  on afs-fs3

There were times when /afs/bu/data and /afs/bu/user/user1 were not
accessible while /afs/bu/usr/user2 was accessible on a client system; at
the same time on a different client, .../user1 and .../user2 were
accessible while .../data was not accessible.

There were rare times when /afs wasn't accessible.  Putting a root.cell
clone on all fileservers appeared to alleviate the /afs access problem.

Read-only and read-write volumes were affected.

	Background and temporary? resolution

BU's web space is primarily served up via AFS volumes which are mounted
read-write below the WWW root directory.  Switching the WWW root into
"maintenance mode" (an alternate root directory volume with read-only
mounts) solved the problem.  Monday the original scenario was restored;
we've gone 30 hours without a relapse.

System loads were not high on fileservers or clients at any time; if
anything loads were unusually low. We've had much higher web service
activity at other times.  We've found no client side evidence of an
activity spike triggering the problem.

If it's relevant, BU's cell is configured with 3 database servers running
openafs 1.2.6 servers with the 1.2.11 (ubik) patch, though the buserver
processes were not patched until after the problem(s) subsided.
	We're running a mix of fileservers (mostly openafs-1.2.9, with a
couple running afs3.6 2.45 and afs3.6 2.38). Hardware is Sun running
sun4x_57 and sun4x_58 (all with 2+Gb memory).

The fileservers run with default settings. We have three web servers which
are responsible for most AFS activity. Two webservers start AFSD using the
standard "MEDIUM" values and one uses the standard "LARGE" values (in
afs.rc).

Fileserver and database server logs showed nothing out of the ordinary.
This is a fundamental concern; while different options may be appropriate
it is quite disturbing to transition into a non-functional state with
nothing in /usr/afs/logs [that I understand] to indicate a problem.

	Questions

The fileserver has "-d <debuglevel>", "-busyat <n>", "-k <stacksize>", and
"-p <#processes>" options which appear to be interesting.  Is there a way
to query or log utilization levels or to get an indication when limits are
exceeded?

What can or should be monitored to expose (and log) activity levels,
timeouts, etc.

Charles Ball
Information Technology
Boston University