[OpenAFS] OpenAFS 1.3.87 and 1.4.0-rc6 stability issues on Solaris 10
chas williams - CONTRACTOR
chas@cmf.nrl.navy.mil
Tue, 11 Oct 2005 14:51:17 -0400
In message <20051011181108.A17698@ccdevli1.in2p3.fr>,Loic Tortay writes:
>Hello,
>Specifically, the problem happens when running the "svcs -p" command
>...
>About one time out of three, the system will panic immediatly.
i seem to get it about every time. analysis follows. i had two traps
in different but similar locations, contract_process_status+0x126 and
contract_process_status+0x129.
[kern.notice] BAD TRAP: type=e (#pf Page fault) rp=d8febde4 addr=fffff83c
[kern.notice]
[kern.notice] svcs:
[kern.notice] #pf Page fault
[kern.notice] Bad kernel fault at addr=0xfffff83c
[kern.notice] pid=734, pc=0xfe8e9e93, sp=0xd3d9e940, eflags=0x10286
[kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6d8<xmme,fxsr,pge,mce,pse,de>
[kern.notice] cr2: fffff83c cr3: 28a2e000
[kern.notice] gs: d90201b0 fs: fe8c0000 es: d9020160 ds: 160
[kern.notice] edi: 9 esi: d3d9e940 ebp: d8febe50 esp: d8febe14
[kern.notice] ebx: e edx: d3d9ea5c ecx: fffff7d8 eax: fffff7d8
[kern.notice] trp: e err: 0 eip: fe8e9e93 cs: 158
[kern.notice] efl: 10286 usp: d3d9e940 ss: d91128c8
[kern.notice]
[kern.notice] BAD TRAP: type=e (#pf Page fault) rp=d8ec8de4 addr=4 occurred in module "ge
[kern.notice]
[kern.notice] svcs:
[kern.notice] #pf Page fault
[kern.notice] Bad kernel fault at addr=0x4
[kern.notice] pid=756, pc=0xfe8e9e96, sp=0xd42ac300, eflags=0x10282
[kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6d8<xmme,fxsr,pge,mce,pse,de>
[kern.notice] cr2: 4 cr3: 3fe06000
[kern.notice] gs: d90201b0 fs: fe8c0000 es: d9020160 ds: 160
[kern.notice] edi: 9 esi: d42ac300 ebp: d8ec8e50 esp: d8ec8e14
[kern.notice] ebx: e edx: d42ac41c ecx: 0 eax: d8b9eaf4
[kern.notice] trp: e err: 0 eip: fe8e9e96 cs: 158
[kern.notice] efl: 10282 usp: d42ac300 ss: d91128c8
disassembly around this location:
contract_process_status+0x11d: addl $0x4,%esp
contract_process_status+0x120: testl %eax,%eax
contract_process_status+0x122: je +0x25 <contract_process_status+0x147>
contract_process_status+0x124: xorl %edi,%edi
contract_process_status+0x126: movl 0x64(%eax),%ecx
contract_process_status+0x129: movl 0x4(%ecx),%ecx
contract_process_status+0x12c: movl -0x18(%ebp),%edx
contract_process_status+0x12f: movl %ecx,(%edx,%edi,4)
contract_process_status+0x132: incl %edi
contract_process_status+0x133: pushl %eax
contract_process_status+0x134: leal 0x114(%esi),%eax
contract_process_status+0x13a: pushl %eax
contract_process_status+0x13b: call -0x26f12 <list_next>
this seems to roughly correspond to this location in common/contract/process.c:
contract_status_common(ct, zone, status, model);
for (loc = 0, cnext = list_head(&ctp->conp_inherited); cnext;
cnext = list_next(&ctp->conp_inherited, cnext)) <<<<<
ctids[loc++] = cnext->ct_id;
ASSERT(loc == nctids);
for (loc = 0, pnext = list_head(&ctp->conp_members); pnext;
in the second crash, %ecx is 0, but if it came from 0x64(%eax)
i would think it should have a value of fecaddba?
> 0xd8b9eaf0+0x64 ::dump
0 1 2 3 \/ 5 6 7 8 9 a b c d e f 0123v56789abcdef
d8b9eb50: 18000000 fecaddba 00000000 e0edc0d8 ................
when i see strange behavior like this i usually think stack/heap
corruption. but i am not so sure this is the case. anyway, it has
me puzzled. i posted this in hopes that someone else might have an idea.