[OpenAFS] afs.GCPAGs in current releases under Linux (RHEL4/5)

Simon Wilkinson sxw@inf.ed.ac.uk
Fri, 5 Mar 2010 09:22:04 +0000


On 5 Mar 2010, at 01:20, Eric.Hagberg@morganstanley.com wrote:

> I've found that if you run a program to generate tokens and pags  
> frequently (about once per second), that fairly soon, the cpu system  
> time on the machine will begin to swallow performance, though it  
> takes a little while to observe it... but if you do that long  
> enough, the machine will eventually grind to a halt. I found that  
> this behavior started between openafs 1.4.1 and 1.4.2, where keyring  
> support got enabled. Some experimentation has shown that the problem  
> is related to the effective disabling of pag garbage collection when  
> keyring support is compiled in.

I've put this in RT as #126669

> Interestingly, just changing the bit of code to allow openafs w/  
> keyring support to do pag GC makes the problem go away, in that you  
> don't get system time spikes/growing forever while afs.GCPAGs=1, but  
> switching to afs.GCPAGs=0 makes the problem come back. So something  
> about keyrings isn't really doing everything it should be if pag GC  
> can make things better.

There's obviously something going awry here. In theory, you don't need  
to garbage collect keyring PAGs, because the keyrings are reference  
counted by the kernel, and our destructor is called when the keyring  
goes away. However, there are a number of known problems with this in  
1.4, in particular involving races between the group information, and  
the establishment of the keyring.

Things are quite different in 1.5 - keyrings are the authoritative  
source of PAG information. If you have time, it would be great if you  
could do the same tests with 1.5, and see if you experience similar  
problems.

S.