[OpenAFS-devel] client stability
Kuba Ober
kuba@mareimbrium.org
Fri, 25 May 2001 17:32:53 +0200
Hi,
can some of the developers provide hints about stability of coda clients on
Linux?
I'm somewhat concerned about it, as it is not always possible to cleanly
unmount /afs
Sometimes it succeeds, but sometimes umount hangs and that's it.
It looks like under 2.4 kernels it's very quick process. The client spirals
down in about 20 minutes.
Under 2.2 kernels it works w/o problems for quite some time. Actually the
only problem I had with 2.2 kernel client was when the server was rebooted
(due to reasons not connected with coda). Any access to /afs returned
`operation has timed out' error or somesuch, but unmounting just hung
forever. After killing the session and starting new shell the umount finally
worked, but then the kernel module was stuck - it had reference count of 2,
even though there were no afsd daemons alive at that moment. After some more
time (a minute or two) the module removal finally succeeded, but the machine
hung w/o kernel panic nor any other indication of problems. IP stack went
dead as well as keyboard handler, as no `lock' lights were operative, and
pinging the machine from the net didn't get return packets. Magic SysReq was
dead as well (yep, I have it compiled into the kernel).
I don't think there are any problems with the server in my test setup :-)
Any hints?
Right now I'm trying to decide whether to look for something other than AFS
(being reluctant to go with Coda as its win9x client is ripply and on the
whole its AFS-derived), or to get involved in the development trying to
rectify those problems. I don't have huge kernel hacking experience, but I
think I'd like to at least try documenting what happens upon unmounting of
/afs and where the process stalls...
Does any of you have success stories with afs on RH linux systems? I'm not
talking about `ideal' setup where you just /etc/init.d/afs start and never
touch it. I'm trying to obtain stable operation in a test case where the
client is brought up via initscript, some file accesses are made, client is
stopped via the initscript, and again (shell script).
In all systems that I've checked (RH 7.0 client, three RH 7.1 clients ), it
never survived more than 20 start/stops. All problems occured on client stop,
though. If it went OK, then client start was always flawless. All processors
on those machines were either PIII or Celeron with 128+ mb of ram and ample
free disk space. Kernels were 2.2.17 on RH7.0, 2.4.2 on RH7.1.
Cheerz,
Kuba