[OpenAFS] fileserver crashes

John W. Sopko Jr. sopko@cs.unc.edu
Wed, 13 Oct 2004 14:07:48 -0400


Our linux/AFS 1.2.11 file server has been hanging the last few weeks.
We have been upgrading machines to Windows XP SPII and OpenAFS 1.7.x
over the last month or so. Here is one issue I found that was causing
the problem:

We have a user who uses a Windows application called Matlab for
generating and processing hundreds of files in AFS space from a Windows
XP machine. He was running OpenAFS 1.2.x client. His machine was upgraded
to Service pack II and OpenAFS 1.3.71. His Matlab application hangs in
windows and our file server eventually melts down.

I am not an expert at debugging AFS, let me know if you want me to try
something. I cranked up the debug on the FileLog to 25. I could see his
machine was constantly logging messages like this, (the user name really
is debug):

Wed Oct 13 12:15:47 2004 FindClient: authenticating connection: authClass=2
Wed Oct 13 12:15:47 2004 FindClient: rxkad conn:
name=debug,inst=,cell=,exp=1097688735,kvno=8
Wed Oct 13 12:15:47 2004 FindClient: authenticating connection: authClass=2
Wed Oct 13 12:15:47 2004 FindClient: rxkad conn:
name=debug,inst=,cell=,exp=1097690546,kvno=8
Wed Oct 13 12:15:47 2004 SAFS_FetchStatus,  Fid = 1769554818.9542.9001, Host
152.2.128.179, Id 5269
Wed Oct 13 12:15:47 2004 SAFS_FetchStatus returns 0

I also ran scout, (I used rxdebug but do not know how to interpret the
results but they did not look suspicious). Within scout the left most
column shows the number of rpc calls to the server. I restarted the file
server and the number of rpc calls went up dramatically, it hit 9999 in
about 2 minutes, then it just shows a *xxx since it limited to 4 columns,
but you can see it constantly counting upward.

This user says he has been running his experiments for the last several
months and did not have a problem until his system was upgrading to
AFS 1.3.71, about the same time are AFS file server problem started to
happen. He left the Matlab application in the hung state, we killed it
and the number of rpc's in scout went below 100.

By the way, when the file server hangs our web server, which access data
on the file server hangs, because it cannot access AFS on this server.
A restart of the file server clears the problem, but it comes back as
the Windows client starts hammering the server again.

For those of you having problems try running scount, and check out your
rpc count:

scout -server hostname -freq 5

You can specify multiple servers and compare stats.


The following was posted when the 1.3.7 client came out. Is the official
word that the windows OpenAFS client should not be used with Windows
applications do to this issue?

---------------------------------------------------
OpenAFS installed on the machine.  No gateway mode.
Try opening a 100MB file in Microsoft Word from AFS
and then perform a "Save as ..." to another filename
within AFS.

You will receive a "Delay Writes warning" and then if you are
using Word XP, Word will crash.

Jeffrey Altman

John W. Sopko Jr. wrote:
 > Can you elaborate on "This causes all Microsoft Office
 > applications to have failures when writing to AFS."
 >
 > Do you mean when using the standard AFS Windows client to AFS, or using the
 > AFS Light Gateway to write to AFS on a Windows system? Thanks.
 >
 > Jeffrey Altman wrote:
 >
 >> OpenAFS 1.3.70 has now shipped.
 >> Examples of some of the hard work which is ahead of us:
 >>
 >
 >> * The architecture of the SMB/CIFS server does not allow for sequential
 >>   processing of SMB/CIFS requests.  This prevents us from implementing
 >>   support for digital signing but more importantly breaks applications
 >>   which use overlapped writes.  This causes all Microsoft Office
 >>   applications to have failures when writing to AFS.  I can't think of
 >>   a more important suite of applications which must simply work if AFS
 >>   is truly to be used in a transparent manner from the end user
 >>   experience.
 >

-- 
John W. Sopko Jr.               University of North Carolina
email: sopko AT cs.unc.edu      Computer Science Dept., CB 3175
Phone: 919-962-1844             Sitterson Hall; Room 044
Fax:   919-962-1799             Chapel Hill, NC 27599-3175