[OpenAFS-devel] RE: fyi - those samba servers have been stable for several daysnow...

Neulinger, Nathan nneul@umr.edu
Fri, 19 Jul 2002 08:46:28 -0500


This is a multi-part message in MIME format.

------_=_NextPart_001_01C22F2A.AE7CB8BB
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

FYI - I've been trying to diagnose a problem with a few of our samba
servers locking up randomly with builds from the cvs trunk and the
protos branch. After testing several different builds, including the
stable12 branch, it appears that whatever is causing the problem is in
the trunk. (And therefore also in protos13.)

I've attached some of the previous notes/conversation about this.=20

Unfortunately, I don't usually get much useful trace information, so I'm
not sure if the previous stuff will be useful or not.=20

If I put the trunk back on the servers, it will usually reproduce within
a day, sometimes within hours. If someone wants me to try a patch, I can
go ahead and put that on one or more of the servers to see if it fixes
this particular problem. Unfortunately, I do not have a method of
reproduction other than waiting and seeing.

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216


> -----Original Message-----
> From: Derrick J Brashear [mailto:shadow@dementia.org]=20
> Sent: Friday, July 19, 2002 8:40 AM
> To: Neulinger, Nathan
> Subject: Re: fyi - those samba servers have been stable for=20
> several daysnow...
>=20
>=20
> On Fri, 19 Jul 2002, Neulinger, Nathan wrote:
>=20
> > Appears that there is definately something in the=20
> trunk+protos that is
> > causing deadlock or other failure.
>=20
> well, the biggest "different" thing is the finegrained dcache=20
> locking, but
> that's not the only thing.
>=20
> post your findings to -devel, i guess, and mention the=20
> difference. it does
> mean we need to be careful about what we pull up
>=20
>=20
>=20

------_=_NextPart_001_01C22F2A.AE7CB8BB
Content-Type: message/rfc822
Content-Transfer-Encoding: 7bit

X-MimeOLE: Produced By Microsoft Exchange V6.0.5762.3
Received:  from umr-msxproto3.umr.edu ([131.151.1.51]) by umr-mail2.umr.edu with Microsoft SMTPSVC(5.0.2195.4905); Tue, 16 Jul 2002 10:10:12 -0500
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Received:  from smtp.umr.edu ([131.151.1.89]) by umr-msxproto3.umr.edu with Microsoft SMTPSVC(5.0.2195.4905); Tue, 16 Jul 2002 10:10:11 -0500
Received:  from smtp.umr.edu (root@mrelay1.cc.umr.edu [131.151.1.120]) via ESMTP by mrelay2.cc.umr.edu (8.12.3/) id g6GFAAmH014201; Tue, 16 Jul 2002 10:10:10 -0500
Received:  from scully.trafford.dementia.org (SCULLY.TRAFFORD.DEMENTIA.ORG [128.2.100.230]) via ESMTP by mrelay1.cc.umr.edu (8.12.1/) id g6GFA9ow017975; Tue, 16 Jul 2002 10:10:10 -0500
Received:  from localhost (root@localhost) by scully.trafford.dementia.org (8.11.6/8.11.6) with SMTP id g6GF9Nq12323 for <nneul@umr.edu>; Tue, 16 Jul 2002 11:09:23 -0400
content-class: urn:content-classes:message
Subject: RE: [OpenAFS-devel] trying to track down a cm hang/lockup...
Date: Tue, 16 Jul 2002 10:09:23 -0500
Message-ID: <Pine.LNX.3.96L.1020716110819.11651V-100000@scully.trafford.dementia.org>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [OpenAFS-devel] trying to track down a cm hang/lockup...
Thread-Index: AcIs2uHWxT5ICphfTBqn9Nmt/HzyHA==
From: "Derrick J Brashear" <shadow@dementia.org>
To: "Neulinger, Nathan" <nneul@umr.edu>

On Tue, 16 Jul 2002, Neulinger, Nathan wrote:

> Yes, I get a similar failure with head. Complete hang.
>=20
> Almost everything appears hung in filemap_nopage, and the system =
appears
> hung in swapper. Does not appear to be spinning. And, all of the =
smbd's
> next frame after filemap_nopage indicates afs_global_lock.=20

Sigh.

> Were you able to get into srvtst03?

Haevn't tried yet.

> I'm not sure if this is the exact same failure as with the head, but =
it
> sure feels similar.=20
>=20
> Interesting. This one is in do_BUG... obviously panic'd, but no =
output:
>=20
> login         D C0311D00     0 20852    708                     =
(NOTLB)
> Call Trace: [<c01159df>] [<e0a2a280>] [<e0a2a280>] [<c0105e63>]
> [<c010600c>]=20
>    [<e0a2a280>] [<e0a1c3a2>] [<c013b2c4>] [<c01469bd>] [<c0107363>]=20
>=20
> not sure why it's got a double invocation of afs_global_lock there...
> Also, it's tracking back to:
>=20
> e0a1c0bc afs_icl_SetSetStat     [libafs-2.4.18.mp]

are you using fstrace?

> as well, called from filp_open.=20

so if you're bored, one more try, use the head of openafs-stable-1_2_x

-D





------_=_NextPart_001_01C22F2A.AE7CB8BB
Content-Type: message/rfc822
Content-Transfer-Encoding: 7bit

Received:  from umr-msxproto3.umr.edu ([131.151.1.51]) by umr-mail2.umr.edu with Microsoft SMTPSVC(5.0.2195.4905); Fri, 12 Jul 2002 15:37:05 -0500
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Received:  from smtp.umr.edu ([131.151.1.89]) by umr-msxproto3.umr.edu with Microsoft SMTPSVC(5.0.2195.4905); Fri, 12 Jul 2002 15:37:05 -0500
Received:  from smtp.umr.edu (root@mrelay1.cc.umr.edu [131.151.1.120]) via ESMTP by mrelay2.cc.umr.edu (8.12.3/) id g6CKb4mH019524; Fri, 12 Jul 2002 15:37:04 -0500
Received:  from grand.central.org (GRAND.CENTRAL.ORG [128.2.194.109]) via ESMTP by mrelay1.cc.umr.edu (8.12.1/) id g6CKb3ow018135; Fri, 12 Jul 2002 15:37:04 -0500
Received:  from grand.central.org (localhost.localdomain [127.0.0.1]) by grand.central.org (Postfix) with ESMTP id B97E89C10; Fri, 12 Jul 2002 16:37:01 -0400 (EDT)
content-class: urn:content-classes:message
Return-Path: <openafs-devel-admin@openafs.org>
X-MimeOLE: Produced By Microsoft Exchange V6.0.5762.3
X-OriginalArrivalTime: 12 Jul 2002 20:37:05.0152 (UTC) FILETIME=[E257E000:01C229E3]
Errors-To: openafs-devel-admin@openafs.org
Delivered-To: openafs-devel@openafs.org
X-Mailman-Version: 2.0.4
X-BeenThere: openafs-devel@openafs.org
List-ID: OpenAFS Developers <openafs-devel.openafs.org>
List-Post: <mailto:openafs-devel@openafs.org>
List-Archive: <https://lists.openafs.org/pipermail/openafs-devel/>
X-Spam-Status: No, hits=0.4 required=5.0 tests=DOUBLE_CAPSWORD,AWL version=2.21
Subject: RE: [OpenAFS-devel] trying to track down a cm hang/lockup...
Date: Fri, 12 Jul 2002 15:36:25 -0500
Message-ID: <4E39457CA36BA347A9940A14D48B11A90B06BB@umr-mail2.umr.edu>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [OpenAFS-devel] trying to track down a cm hang/lockup...
Thread-Index: AcIo76h6Py/HIz/mSr2ydIYMinqX7AAAGo1QADuQzqAAASnOEA==
List-Help: <mailto:openafs-devel-request@openafs.org?subject=help>
List-Subscribe: <https://lists.openafs.org/mailman/listinfo/openafs-devel>,<mailto:openafs-devel-request@openafs.org?subject=subscribe>
List-Unsubscribe: <https://lists.openafs.org/mailman/listinfo/openafs-devel>,<mailto:openafs-devel-request@openafs.org?subject=unsubscribe>
From: "Neulinger, Nathan" <nneul@umr.edu>
To: "OpenAFS-Devel Mailing List (E-mail)" <openafs-devel@openafs.org>

Tracing it out by hand with symbol list gets me:

__read_lock_failed
cdput
__user_walk
getname
dput
vcache2inode    (libafs)
sock_recvmsg
follow_down
path_release
d_lookup

I don't have much more info though unfortunately.

(If one of the core developers is handy with kdb and would be willing to
look around at some point - I've got these machines on serial
consoles... Just got to rebuild kernel with kdb support first. Don't
know how much of an impact that has though.  We've got three in a
checked rotation, so I can leave one in the hung state for a while if
need be.)

Other two are still running, will perform same checks on them to see if
it traces to the same problem.=20

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216


> -----Original Message-----
> From: Neulinger, Nathan=20
> Sent: Friday, July 12, 2002 3:19 PM
> To: OpenAFS-Devel Mailing List (E-mail)
> Subject: RE: [OpenAFS-devel] trying to track down a cm hang/lockup...
>=20
>=20
> Well, once of them just crashed again... Looks to me like whatever is
> crashing is enough to completely lock the machine, not just AFS. There
> was no oops. I've yet to be able to get a useful trace out of it...
> Still looking over it though... Based on the symbol offsets,=20
> it looks to
> me like it is somewhere in d_lookup.
>=20
> Interesting, repeatedly hitting Alt-SysRQ-P has it bouncing around to
> different addresses, but all within d_lookup. Could there be something
> that cache manager corrupted that would be causing the kernel=20
> to spin in
> d_lookup?
>=20
> I swear, even if it forces me to look at assembly, kdb is going in my
> next kernel build.=20
>=20
> It's this section of the dissassembled d_lookup:
>=20
>      ad3:       8b 1c 24                mov    (%esp,1),%ebx
>      ad6:       83 eb 10                sub    $0x10,%ebx
>      ad9:       39 2c 24                cmp    %ebp,(%esp,1)
>      adc:       0f 84 ae 00 00 00       je     b90 <d_lookup+0x120>
>      ae2:       8b 04 24                mov    (%esp,1),%eax
>      ae5:       8b 54 24 08             mov    0x8(%esp,1),%edx
>      ae9:       8b 00                   mov    (%eax),%eax
>      aeb:       89 04 24                mov    %eax,(%esp,1)
>      aee:       39 53 44                cmp    %edx,0x44(%ebx)
>      af1:       75 e0                   jne    ad3 <d_lookup+0x63>
>=20
> -- Nathan
>=20
> ------------------------------------------------------------
> Nathan Neulinger                       EMail:  nneul@umr.edu
> University of Missouri - Rolla         Phone: (573) 341-4841
> Computing Services                       Fax: (573) 341-4216
>=20
>=20
> > -----Original Message-----
> > From: Neulinger, Nathan=20
> > Sent: Thursday, July 11, 2002 10:32 AM
> > To: 'Derrick J Brashear'
> > Subject: RE: [OpenAFS-devel] trying to track down a cm=20
> hang/lockup...
> >=20
> >=20
> > Have not tried the head yet.
> >=20
> > If I don't get anything useful out of the next failure,=20
> > trying head will likely be the next step.=20
> >=20
> > -- Nathan
> >=20
> > ------------------------------------------------------------
> > Nathan Neulinger                       EMail:  nneul@umr.edu
> > University of Missouri - Rolla         Phone: (573) 341-4841
> > Computing Services                       Fax: (573) 341-4216
> >=20
> >=20
> > > -----Original Message-----
> > > From: Derrick J Brashear [mailto:shadow@dementia.org]=20
> > > Sent: Thursday, July 11, 2002 10:28 AM
> > > To: Neulinger, Nathan
> > > Subject: RE: [OpenAFS-devel] trying to track down a cm=20
> > hang/lockup...
> > >=20
> > >=20
> > > On Thu, 11 Jul 2002, Neulinger, Nathan wrote:
> > >=20
> > > > > > At the moment, I've got the watchdog turned off on the=20
> > > > > machines, and am
> > > > > > waiting for the next failure to see what I can determine...
> > > > >=20
> > > > > ok. you're not running with the lock tracing patches to=20
> > > > > fstrace, are you?
> > > > > i never got those to work without problems
> > > >=20
> > > > Hmm... Would they be in the protos branch/head and enabled=20
> > > by default?
> > > > If so, yes. Otherwise no.=20
> > >=20
> > > If they are, they aren't enabled. Have you determined this is=20
> > > in the head
> > > and the protos branch?
> > >=20
> > >=20
> > >=20
> >=20
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>=20
_______________________________________________
OpenAFS-devel mailing list
OpenAFS-devel@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-devel

------_=_NextPart_001_01C22F2A.AE7CB8BB--