[OpenAFS-devel] System lockup with do_IRQ: stack overflow

Deon George deon@wurley.net
Sat, 24 Feb 2007 12:12:29 +1000


Hey Guys,

A couple of months ago, I did an upgrade from a Celeron processor to an AMD
Athlon(tm) 64 Processor 3000+, and at the same time went from OpenAFS 1.4.1 to
1.4.2 - as well as FC5 to FC6.

Since the upgrade, my server frequently locked up under heavy write I/O - and
if I was able to catch it before the screen blanked, I would see

do_IRQ: stack overflow <number>
I think the number was different each time (I didnt write it down).

I've spent quite a bit of time trying to see where the problem is and testing
the server with components added/removed - and I'm confident that it is AFS
(client) that is causing the problem.

My server had xen/vmware, zaptel, openswan and afs installed (custom kernel
modules on a base FC6), so I've removed openswan, and changed from xen to
vmware server. (With no relief from the problem.)

I even borrowed a AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ and tried on
it, but it showed the same problem. (Ensuring it wasnt a hardware problem.)

Because of the frequent lockups on the host, and knowing that it locked up due
to high write activity, I moved my write activity to a VMware virtual machine
with only openafs client (ie: no afs server, no zaptel, no openswan and no
xen/vmware server components) and I can consistently make the guest lock up
under heavy I/O - and when it does it will show do_IRQ: stack overflow on the
console (and nothing else).

(I didnt have this problem when I was FC5 and openafs 1.4.1 - but I might not
have pushed it as hard either).

I changed my server to RHEL5 (beta 2) to see if it also exhibited the problem
(to remove FC6 as a cause), and it still exhibits the problem (both on the
host and as a VMware guest).

My AFS server is the host, and without an AFS client on it, it runs happily
for days, even if the guests are doing hard write I/O (even to the AFS before
they lock up).

Has anybody else seen this problem?

I dont know how to fix it myself, but I'll happily help/test things if you
think you know what it is..

--