[OpenAFS-devel] client stability

Kuba Ober kuba@mareimbrium.org
Fri, 25 May 2001 17:32:53 +0200


Hi,

can some of the developers provide hints about stability of coda clients on 
Linux?

I'm somewhat concerned about it, as it is not always possible to cleanly 
unmount /afs
Sometimes it succeeds, but sometimes umount hangs and that's it.

It looks like under 2.4 kernels it's very quick process. The client spirals 
down in about 20 minutes.

Under 2.2 kernels it works w/o problems for quite some time. Actually the 
only problem I had with 2.2 kernel client was when the server was rebooted 
(due to reasons not connected with coda). Any access to /afs returned 
`operation has timed out' error or somesuch, but unmounting just hung 
forever. After killing the session and starting new shell the umount finally 
worked, but then the kernel module was stuck - it had reference count of 2, 
even though there were no afsd daemons alive at that moment. After some more 
time (a minute or two) the module removal finally succeeded, but the machine 
hung w/o kernel panic nor any other indication of problems. IP stack went 
dead as well as keyboard handler, as no `lock' lights were operative, and 
pinging the machine from the net didn't get return packets. Magic SysReq was 
dead as well (yep, I have it compiled into the kernel).

I don't think there are any problems with the server in my test setup :-)

Any hints?

Right now I'm trying to decide whether to look for something other than AFS 
(being reluctant to go with Coda as its win9x client is ripply and on the 
whole its AFS-derived), or to get involved in the development trying to 
rectify those problems. I don't have huge kernel hacking experience, but I 
think I'd like to at least try documenting what happens upon unmounting of 
/afs and where the process stalls...

Does any of you have success stories with afs on RH linux systems? I'm not 
talking about `ideal' setup where you just /etc/init.d/afs start and never 
touch it. I'm trying to obtain stable operation in a test case where the 
client is brought up via initscript, some file accesses are made, client is 
stopped via the initscript, and again (shell script).

In all systems that I've checked (RH 7.0 client, three RH 7.1 clients ), it 
never survived more than 20 start/stops. All problems occured on client stop, 
though. If it went OK, then client start was always flawless. All processors 
on those machines were either PIII or Celeron with 128+ mb of ram and ample 
free disk space. Kernels were 2.2.17 on RH7.0, 2.4.2 on RH7.1.

Cheerz,
Kuba