[OpenAFS-devel] trying to track down a cm hang/lockup...

Neulinger, Nathan nneul@umr.edu
Fri, 12 Jul 2002 15:36:25 -0500


Tracing it out by hand with symbol list gets me:

__read_lock_failed
cdput
__user_walk
getname
dput
vcache2inode    (libafs)
sock_recvmsg
follow_down
path_release
d_lookup

I don't have much more info though unfortunately.

(If one of the core developers is handy with kdb and would be willing to
look around at some point - I've got these machines on serial
consoles... Just got to rebuild kernel with kdb support first. Don't
know how much of an impact that has though.  We've got three in a
checked rotation, so I can leave one in the hung state for a while if
need be.)

Other two are still running, will perform same checks on them to see if
it traces to the same problem.=20

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216


> -----Original Message-----
> From: Neulinger, Nathan=20
> Sent: Friday, July 12, 2002 3:19 PM
> To: OpenAFS-Devel Mailing List (E-mail)
> Subject: RE: [OpenAFS-devel] trying to track down a cm hang/lockup...
>=20
>=20
> Well, once of them just crashed again... Looks to me like whatever is
> crashing is enough to completely lock the machine, not just AFS. There
> was no oops. I've yet to be able to get a useful trace out of it...
> Still looking over it though... Based on the symbol offsets,=20
> it looks to
> me like it is somewhere in d_lookup.
>=20
> Interesting, repeatedly hitting Alt-SysRQ-P has it bouncing around to
> different addresses, but all within d_lookup. Could there be something
> that cache manager corrupted that would be causing the kernel=20
> to spin in
> d_lookup?
>=20
> I swear, even if it forces me to look at assembly, kdb is going in my
> next kernel build.=20
>=20
> It's this section of the dissassembled d_lookup:
>=20
>      ad3:       8b 1c 24                mov    (%esp,1),%ebx
>      ad6:       83 eb 10                sub    $0x10,%ebx
>      ad9:       39 2c 24                cmp    %ebp,(%esp,1)
>      adc:       0f 84 ae 00 00 00       je     b90 <d_lookup+0x120>
>      ae2:       8b 04 24                mov    (%esp,1),%eax
>      ae5:       8b 54 24 08             mov    0x8(%esp,1),%edx
>      ae9:       8b 00                   mov    (%eax),%eax
>      aeb:       89 04 24                mov    %eax,(%esp,1)
>      aee:       39 53 44                cmp    %edx,0x44(%ebx)
>      af1:       75 e0                   jne    ad3 <d_lookup+0x63>
>=20
> -- Nathan
>=20
> ------------------------------------------------------------
> Nathan Neulinger                       EMail:  nneul@umr.edu
> University of Missouri - Rolla         Phone: (573) 341-4841
> Computing Services                       Fax: (573) 341-4216
>=20
>=20
> > -----Original Message-----
> > From: Neulinger, Nathan=20
> > Sent: Thursday, July 11, 2002 10:32 AM
> > To: 'Derrick J Brashear'
> > Subject: RE: [OpenAFS-devel] trying to track down a cm=20
> hang/lockup...
> >=20
> >=20
> > Have not tried the head yet.
> >=20
> > If I don't get anything useful out of the next failure,=20
> > trying head will likely be the next step.=20
> >=20
> > -- Nathan
> >=20
> > ------------------------------------------------------------
> > Nathan Neulinger                       EMail:  nneul@umr.edu
> > University of Missouri - Rolla         Phone: (573) 341-4841
> > Computing Services                       Fax: (573) 341-4216
> >=20
> >=20
> > > -----Original Message-----
> > > From: Derrick J Brashear [mailto:shadow@dementia.org]=20
> > > Sent: Thursday, July 11, 2002 10:28 AM
> > > To: Neulinger, Nathan
> > > Subject: RE: [OpenAFS-devel] trying to track down a cm=20
> > hang/lockup...
> > >=20
> > >=20
> > > On Thu, 11 Jul 2002, Neulinger, Nathan wrote:
> > >=20
> > > > > > At the moment, I've got the watchdog turned off on the=20
> > > > > machines, and am
> > > > > > waiting for the next failure to see what I can determine...
> > > > >=20
> > > > > ok. you're not running with the lock tracing patches to=20
> > > > > fstrace, are you?
> > > > > i never got those to work without problems
> > > >=20
> > > > Hmm... Would they be in the protos branch/head and enabled=20
> > > by default?
> > > > If so, yes. Otherwise no.=20
> > >=20
> > > If they are, they aren't enabled. Have you determined this is=20
> > > in the head
> > > and the protos branch?
> > >=20
> > >=20
> > >=20
> >=20
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>=20