[OpenAFS] afsd having problems on Linux machines

Garance A Drosihn drosih@rpi.edu
Mon, 4 Nov 2002 15:48:20 -0500


At 1:00 PM -0500 11/4/02, Derrick J Brashear wrote:
>On Mon, 4 Nov 2002, Dr A V Le Blanc wrote:
>
>  > Recently I've seen a couple of machines with problems like
>  > this.  Here are log files from two AFS clients.  They both
>  > had difficulties at the same time on Sunday, when there is
>  > a cron job runnning.
>  >
>  > What is the '-stat' parameter?
>
>I wouldn't expect this unless you have a lot of files legitimately
>open and you use the default setting (which is I think 300)
>
>In this case it defines the number of vcaches you have; in the
>Linux case that maps directly to a private inode pool.

I hit something similar last week.  We have OpenAFS running on a
machine which is also running samba.  We see as many as 300 PC's
connected to that machine, with as many as 400 shares mounted.

Last week we had openafs die with:
     Increase -stat parameter of afsd(VLRU cycle?)<1>Unable to
        handle kernel paging request at virtual address ffffffff

At which point we had to reboot the machine to get afs back.  We
had another similar crash about 24 hours later (and again had to
reboot the machine to get openafs back).  We had openafs configured
to startup with:

XLARGE="-stat 3600 -dcache 3600 -daemons 5 -volumes 196 -files 50000"

but I increased the stat parameter to:

XLARGE="-stat 4200 -dcache 3600 -daemons 5 -volumes 196 -files 50000"

and we haven't seen any problems since then.  I do know that our
network was having problems of it's own at about the same time as
the second crash.

This is with openafs 1.2.7 and kernel 2.4.18-10smp from redhat.
Does it seem likely that we'd really need -stat bumped up over 3600?
Is there some way I could track how close we're getting to the
number we have afs startup with?  When we do hit the limit, does it
make sense that we would have to reboot the machine to get things
working again?

It would not surprise me if we're getting hit by some PC users who
are "helpfully" scanning our local cell for viruses, but I'm not
sure of a good way to automatically notice when that is happening.
I have managed to catch a few computer-center staff members who
had done this by mistake.

-- 
Garance Alistair Drosehn            =   gad@gilead.netel.rpi.edu
Senior Systems Programmer           or  gad@freebsd.org
Rensselaer Polytechnic Institute    or  drosih@rpi.edu