[OpenAFS] OpenAFS 1.3.87 and 1.4.0-rc6 stability issues on Solaris 10

Tue, 11 Oct 2005 14:51:17 -0400

In message <20051011181108.A17698@ccdevli1.in2p3.fr>,Loic Tortay writes:
>Hello,
>Specifically, the problem happens when running the "svcs -p" command
>...
>About one time out of three, the system will panic immediatly.

i seem to get it about every time.  analysis follows.  i had two traps
in different but similar locations,  contract_process_status+0x126 and
contract_process_status+0x129.

	[kern.notice] BAD TRAP: type=e (#pf Page fault) rp=d8febde4 addr=fffff83c
	[kern.notice]
	[kern.notice] svcs:
	[kern.notice] #pf Page fault
	[kern.notice] Bad kernel fault at addr=0xfffff83c
	[kern.notice] pid=734, pc=0xfe8e9e93, sp=0xd3d9e940, eflags=0x10286
	[kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6d8<xmme,fxsr,pge,mce,pse,de>
	[kern.notice] cr2: fffff83c cr3: 28a2e000
	[kern.notice]    gs: d90201b0  fs: fe8c0000  es: d9020160  ds:      160
	[kern.notice]   edi:        9 esi: d3d9e940 ebp: d8febe50 esp: d8febe14
	[kern.notice]   ebx:        e edx: d3d9ea5c ecx: fffff7d8 eax: fffff7d8
	[kern.notice]   trp:        e err:        0 eip: fe8e9e93  cs:      158
	[kern.notice]   efl:    10286 usp: d3d9e940  ss: d91128c8
	[kern.notice]

	[kern.notice] BAD TRAP: type=e (#pf Page fault) rp=d8ec8de4 addr=4 occurred in module "ge
	[kern.notice]
	[kern.notice] svcs:
	[kern.notice] #pf Page fault
	[kern.notice] Bad kernel fault at addr=0x4
	[kern.notice] pid=756, pc=0xfe8e9e96, sp=0xd42ac300, eflags=0x10282
	[kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6d8<xmme,fxsr,pge,mce,pse,de>
	[kern.notice] cr2: 4 cr3: 3fe06000
	[kern.notice]    gs: d90201b0  fs: fe8c0000  es: d9020160  ds:      160
	[kern.notice]   edi:        9 esi: d42ac300 ebp: d8ec8e50 esp: d8ec8e14
	[kern.notice]   ebx:        e edx: d42ac41c ecx:        0 eax: d8b9eaf4
	[kern.notice]   trp:        e err:        0 eip: fe8e9e96  cs:      158
	[kern.notice]   efl:    10282 usp: d42ac300  ss: d91128c8

disassembly around this location:

	contract_process_status+0x11d:  addl   $0x4,%esp
	contract_process_status+0x120:  testl  %eax,%eax
	contract_process_status+0x122:  je     +0x25    <contract_process_status+0x147>
	contract_process_status+0x124:  xorl   %edi,%edi
	contract_process_status+0x126:  movl   0x64(%eax),%ecx
	contract_process_status+0x129:  movl   0x4(%ecx),%ecx
	contract_process_status+0x12c:  movl   -0x18(%ebp),%edx
	contract_process_status+0x12f:  movl   %ecx,(%edx,%edi,4)
	contract_process_status+0x132:  incl   %edi
	contract_process_status+0x133:  pushl  %eax
	contract_process_status+0x134:  leal   0x114(%esi),%eax
	contract_process_status+0x13a:  pushl  %eax
	contract_process_status+0x13b:  call   -0x26f12 <list_next>

this seems to roughly correspond to this location in common/contract/process.c:

	contract_status_common(ct, zone, status, model);
	for (loc = 0, cnext = list_head(&ctp->conp_inherited); cnext;
	    cnext = list_next(&ctp->conp_inherited, cnext))			<<<<<
		ctids[loc++] = cnext->ct_id;
	ASSERT(loc == nctids);
	for (loc = 0, pnext = list_head(&ctp->conp_members); pnext;

in the second crash, %ecx is 0, but if it came from 0x64(%eax)
i would think it should have a value of fecaddba?

	> 0xd8b9eaf0+0x64 ::dump
		    0 1 2 3 \/ 5 6 7  8 9 a b  c d e f  0123v56789abcdef
	d8b9eb50:  18000000 fecaddba 00000000 e0edc0d8  ................

when i see strange behavior like this i usually think stack/heap
corruption.  but i am not so sure this is the case.  anyway, it has
me puzzled.  i posted this in hopes that someone else might have an idea.