[OpenAFS] Re: httpd blocked

Berthold Cogel cogel@uni-koeln.de
Tue, 28 Jun 2011 17:35:27 +0200


Am 23.06.2011 17:44, schrieb Andrew Deason:
> On Mon, 20 Jun 2011 16:19:01 -0700
> Jonathan Nilsson <jnilsson@uci.edu> wrote:
> 
>> i suspect that the system kept spawning httpd processes as old ones
>> got blocked and eventually it ran out of memory and became
>> unresponsive. after a reboot it works fine. so the question is, what
>> caused the afs cache manager to respond so slow?
>>
>> can anyone confirm if they have seen kernel messages like this? how
>> can i confirm if the problem is with the client or the server? i see
>> no error messages in BosLog, FileLog, or VolserLog on our servers...
> 
> If the processes were hanging forever or for a very long time, it's not
> likely to be the fault of any server, since the client doesn't wait
> around forever for a response. I assume there were no messages about
> losing contact with file or vl servers in the client logs around that
> time?
> 
> It's easier to see what's going on if we know what's going on with the
> rest of the system when that happens. If you ever catch it doing that,
> running 'echo t > /proc/sysrq-trigger' will generate a lot of info (some
> of it useful) in syslog. Or if you can get the machine to dump core,
> that's the most useful thing, but you don't want to just go giving that
> out to anybody.
> 

We see this kind of ghosts from time to time on our web servers (RHEL5,
VMware). But we don't get the kernel messages. Only the usual 'lost
contact with file server' messages. It's of course not a problem with
the server because we can cd into the path on other client systems.
Today we had a load of about 250 on a 2 cpu VM. 'fs flushvolume' from
the root along the tree of data for this webserver fixed the problem.
What happens is that the apache tries to deliver a file and runs into
afs timeouts. This happens to one process after the other and one new
process after the other is forked until the internal limit of 256 apache
instances is reached..

We don't know how to debug the problem yet.

On some machines we're running up to 30 apache instances for smaller
websites. Each instance is wrapped with kauth and an seperate srvtab to
isolate the apaches from each other in afs. And each apache uses a
different user.
I don't know if this causes problems with the token handling in the
kernel. The kernel keyring size isn't a problem anymore since RHEL 5.5.


Regards
Berthold Cogel

-- 
Dipl. Chem. Dr. Berthold Cogel           University of Cologne
E-Mail: cogel@uni-koeln.de               Regionales Rechenzentrum (RRZK)
Tel.:   +49(0)221/470-7873               Robert-Koch-Str. 10
FAX:    +49(0)221/478-86845              D-50931 Cologne - Germany