[OpenAFS] weird memory problem on i386_linux26 with 1.4.2

Christopher D. Clausen cclausen@acm.org
Mon, 13 Nov 2006 14:25:02 -0600


Okay, so sleepless.acm.uiuc.edu hosts all websites on www.acm.uiuc.edu. 
Its Debian sarge on x86, with apache2, mod_php5 (from backports.org), 
and Trac running under mod_fastcgi or mod_fcgid depending on if its SSL 
or not.  Its a dual Xeon 2.0 GHz (hyperthreaded and hyperthreading is 
turned on, which might actually be the problem, I don't have another HT 
box to test.)  Machine has 1GB of RAM and two SCSI HDDs, one of them 
dedicated to the AFS cache.

Every 3 weeks or so, the machine ends up using so much non-pagable 
memory that OOM killer starts whacking processes and in general, bad 
things happen.  Very little if any swap is in use (on the order of a few 
MBs.)  This can be solved by stopping everything that is accessing AFS 
and restarting the AFS client.  Its fine for another 3 weeks and the 
problem repeats.

We were running 1.4.1 and I just upgraded to 1.4.2 (about three weeks 
ago) and it still has this problem.

I'm currently running with the Debian 1.4.2-2 package (backported to 
sarge) default afsd options for a 14GB cache and have tried using a 
smaller 5GB cache and reducing the afsd parameters with no effect.  The 
standard debian 2.6.8-3-686-smp kernel is in use.  Cache partition is 
ext3.  I believe that is safe, right?

This same machine was working fine for over a year as a workstation with 
a much smaller AFS cache (although an admitedly much smaller load as 
well,) so something about the current setup has broken things.

I'm mostly a Windows guy, so I'm not really sure how to debug this 
further or otherwise figure out what is using RAM, (although I'm pretty 
sure its somehow afsd).  Vmstat -m reports some rather large allocations 
of certain block sizes, but thats about all I know.  Well, that and the 
fact that restarting the AFS client fixes the problem for another 3 
weeks.

Anyone have any tips on tracking this down?  Or think it might be the 
hyperthreading?

<<CDC
-- 
Christoher D. Clausen
ACM@UIUC SysAdmin