[OpenAFS] Fileserver process hung on startup

Todd DeSantis atd@us.ibm.com
Mon, 12 Apr 2004 15:18:48 -0400


--0__=08BBE4E7DFFA23618f9e8a93df938690918c08BBE4E7DFFA2361
Content-type: multipart/alternative; 
	Boundary="1__=08BBE4E7DFFA23618f9e8a93df938690918c08BBE4E7DFFA2361"

--1__=08BBE4E7DFFA23618f9e8a93df938690918c08BBE4E7DFFA2361
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: quoted-printable






Hi -

About once every 2 years we used to get reports of a
fileserver not starting properly and it had to do with
the fileserver not properly contacting the vlservers.

We could get around it by doing a

      bos restart <db server> vlserver

I don't know if this is the same problem you are seeing,
but you could try this.

Possbily even getting  the rxdebug output of the vlserver
machine

# rxdebug <db server> -port 7003  -allconn -rxstat > <db-server>.7003

Find the connections for this fileserver in the rxdebug output
and see if it has any issues with it.

Just some thoughts.

Thanks

Todd



                                                                       =
    
             John Morris                                               =
    
             <openafs@butchwax                                         =
    
             .com>                                                     =
 To 
             Sent by:                  openafs-info@openafs.org        =
    
             openafs-info-admi                                         =
 cc 
             n@openafs.org                                             =
    
                                                                   Subj=
ect 
                                       Re: [OpenAFS] Fileserver process=
    
             04/11/2004 04:50          hung on startup                 =
    
             PM                                                        =
    
                                                                       =
    
                                                                       =
    
                                                                       =
    
                                                                       =
    
                                                                       =
    




Hey!  (Sorry for sending previous mails to you as well as the list,
Derrick.)

Well, after clearing up for almost two weeks, this problem is occurring=

again.  Fileserver isn't listening on 2040; probably in the middle of a=

threads syscall again; the LD_ASSUME_KERNEL trick isn't working.  Again=
,
I'm running openafs 1.2.11, smp kernel 2.4.23, RH8.

Where should I continue looking to solve this?  Just point me a
direction, I'll take it from there.  Thanks!

             John


On Tue, 2004-03-30 at 00:38, Derrick J Brashear wrote:
> On Tue, 30 Mar 2004, John Morris wrote:
>
> > Cool, got the strace.  After the expected loading of .so files and
> > config files and such, we see it contact the vldbs of the other two=

> > servers and do a 'rt_sigsuspend'; the second from which it never
> > returns.  Is this a thread locking issue?
>
> maybe, but if so the LD_ASSUME_KERNEL should have fixed it, unless
someone
> broke that, and one would hope if they did that they'd also have fixe=
d
> pthreads.
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info


_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info
=

--1__=08BBE4E7DFFA23618f9e8a93df938690918c08BBE4E7DFFA2361
Content-type: text/html; charset=US-ASCII
Content-Disposition: inline
Content-transfer-encoding: quoted-printable

<html><body>
<p>Hi -<br>
<br>
About once every 2 years we used to get reports of a <br>
fileserver not starting properly and it had to do with<br>
the fileserver not properly contacting the vlservers.<br>
<br>
We could get around it by doing a <br>
<br>
	bos restart &lt;db server&gt; vlserver<br>
<br>
I don't know if this is the same problem you are seeing,<br>
but you could try this.<br>
<br>
Possbily even getting  the rxdebug output of the vlserver<br>
machine<br>
<br>
# rxdebug &lt;db server&gt; -port 7003  -allconn -rxstat &gt; &lt;db-se=
rver&gt;.7003<br>
<br>
Find the connections for this fileserver in the rxdebug output<br>
and see if it has any issues with it.<br>
<br>
Just some thoughts.<br>
<br>
Thanks<br>
<br>
Todd<br>
 <br>
<img src=3D"cid:10__=3D08BBE4E7DFFA23618f9e8a93df938@us.ibm.com" width=3D=
"16" height=3D"16" alt=3D"Inactive hide details for John Morris &lt;ope=
nafs@butchwax.com&gt;">John Morris &lt;openafs@butchwax.com&gt;<br>
<br>
<br>

<table width=3D"100%" border=3D"0" cellspacing=3D"0" cellpadding=3D"0">=

<tr valign=3D"top"><td style=3D"background-image:url(cid:20__=3D08BBE4E=
7DFFA23618f9e8a93df938@us.ibm.com); background-repeat: no-repeat; " wid=
th=3D"40%">
<ul>
<ul>
<ul>
<ul><b><font size=3D"2">John Morris &lt;openafs@butchwax.com&gt;</font>=
</b><font size=3D"2"> </font><br>
<font size=3D"2">Sent by: openafs-info-admin@openafs.org</font>
<p><font size=3D"2">04/11/2004 04:50 PM</font></ul>
</ul>
</ul>
</ul>
</td><td width=3D"60%">
<table width=3D"100%" border=3D"0" cellspacing=3D"0" cellpadding=3D"0">=

<tr valign=3D"top"><td width=3D"1%" valign=3D"middle"><img src=3D"cid:3=
0__=3D08BBE4E7DFFA23618f9e8a93df938@us.ibm.com" border=3D"0" height=3D"=
1" width=3D"58" alt=3D""><br>
<div align=3D"right"><font size=3D"2">To</font></div></td><td width=3D"=
100%"><img src=3D"cid:30__=3D08BBE4E7DFFA23618f9e8a93df938@us.ibm.com" =
border=3D"0" height=3D"1" width=3D"1" alt=3D""><br>
<font size=3D"2">openafs-info@openafs.org</font></td></tr>

<tr valign=3D"top"><td width=3D"1%" valign=3D"middle"><img src=3D"cid:3=
0__=3D08BBE4E7DFFA23618f9e8a93df938@us.ibm.com" border=3D"0" height=3D"=
1" width=3D"58" alt=3D""><br>
<div align=3D"right"><font size=3D"2">cc</font></div></td><td width=3D"=
100%"><img src=3D"cid:30__=3D08BBE4E7DFFA23618f9e8a93df938@us.ibm.com" =
border=3D"0" height=3D"1" width=3D"1" alt=3D""><br>
</td></tr>

<tr valign=3D"top"><td width=3D"1%" valign=3D"middle"><img src=3D"cid:3=
0__=3D08BBE4E7DFFA23618f9e8a93df938@us.ibm.com" border=3D"0" height=3D"=
1" width=3D"58" alt=3D""><br>
<div align=3D"right"><font size=3D"2">Subject</font></div></td><td widt=
h=3D"100%"><img src=3D"cid:30__=3D08BBE4E7DFFA23618f9e8a93df938@us.ibm.=
com" border=3D"0" height=3D"1" width=3D"1" alt=3D""><br>
<font size=3D"2">Re: [OpenAFS] Fileserver process hung on startup</font=
></td></tr>
</table>

<table border=3D"0" cellspacing=3D"0" cellpadding=3D"0">
<tr valign=3D"top"><td width=3D"58"><img src=3D"cid:30__=3D08BBE4E7DFFA=
23618f9e8a93df938@us.ibm.com" border=3D"0" height=3D"1" width=3D"1" alt=
=3D""></td><td width=3D"336"><img src=3D"cid:30__=3D08BBE4E7DFFA23618f9=
e8a93df938@us.ibm.com" border=3D"0" height=3D"1" width=3D"1" alt=3D""><=
/td></tr>
</table>
</td></tr>
</table>
<br>
<tt>Hey! &nbsp;(Sorry for sending previous mails to you as well as the =
list,<br>
Derrick.)<br>
<br>
Well, after clearing up for almost two weeks, this problem is occurring=
<br>
again. &nbsp;Fileserver isn't listening on 2040; probably in the middle=
 of a<br>
threads syscall again; the LD_ASSUME_KERNEL trick isn't working. &nbsp;=
Again,<br>
I'm running openafs 1.2.11, smp kernel 2.4.23, RH8. &nbsp;<br>
<br>
Where should I continue looking to solve this? &nbsp;Just point me a<br=
>
direction, I'll take it from there. &nbsp;Thanks!<br>
<br>
		 John<br>
<br>
<br>
On Tue, 2004-03-30 at 00:38, Derrick J Brashear wrote:<br>
&gt; On Tue, 30 Mar 2004, John Morris wrote:<br>
&gt; <br>
&gt; &gt; Cool, got the strace. &nbsp;After the expected loading of .so=
 files and<br>
&gt; &gt; config files and such, we see it contact the vldbs of the oth=
er two<br>
&gt; &gt; servers and do a 'rt_sigsuspend'; the second from which it ne=
ver<br>
&gt; &gt; returns. &nbsp;Is this a thread locking issue?<br>
&gt; <br>
&gt; maybe, but if so the LD_ASSUME_KERNEL should have fixed it, unless=
 someone<br>
&gt; broke that, and one would hope if they did that they'd also have f=
ixed<br>
&gt; pthreads.<br>
&gt; _______________________________________________<br>
&gt; OpenAFS-info mailing list<br>
&gt; OpenAFS-info@openafs.org<br>
&gt; </tt><tt><a href=3D"https://lists.openafs.org/mailman/listinfo/ope=
nafs-info">https://lists.openafs.org/mailman/listinfo/openafs-info</a><=
/tt><tt><br>
<br>
<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
OpenAFS-info@openafs.org<br>
</tt><tt><a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-=
info">https://lists.openafs.org/mailman/listinfo/openafs-info</a></tt><=
tt><br>
</tt><br>
</body></html>=


--1__=08BBE4E7DFFA23618f9e8a93df938690918c08BBE4E7DFFA2361--


--0__=08BBE4E7DFFA23618f9e8a93df938690918c08BBE4E7DFFA2361
Content-type: image/gif; 
	name="graycol.gif"
Content-Disposition: inline; filename="graycol.gif"
Content-ID: <10__=08BBE4E7DFFA23618f9e8a93df938@us.ibm.com>
Content-transfer-encoding: base64

R0lGODlhEAAQAKECAMzMzAAAAP///wAAACH5BAEAAAIALAAAAAAQABAAAAIXlI+py+0PopwxUbpu
ZRfKZ2zgSJbmSRYAIf4fT3B0aW1pemVkIGJ5IFVsZWFkIFNtYXJ0U2F2ZXIhAAA7

--0__=08BBE4E7DFFA23618f9e8a93df938690918c08BBE4E7DFFA2361
Content-type: image/gif; 
	name="pic10972.gif"
Content-Disposition: inline; filename="pic10972.gif"
Content-ID: <20__=08BBE4E7DFFA23618f9e8a93df938@us.ibm.com>
Content-transfer-encoding: base64

R0lGODlhWABDALP/AAAAAK04Qf79/o+Gm7WuwlNObwoJFCsoSMDAwGFsmIuezf///wAAAAAAAAAA
AAAAACH5BAEAAAgALAAAAABYAEMAQAT/EMlJq704682770RiFMRinqggEUNSHIchG0BCfHhOjAuh
EDeUqTASLCbBhQrhG7xis2j0lssNDopE4jfIJhDaggI8YB1sZeZgLVA9YVCpnGagVjV171aRVrYR
RghXcAGFhoUETwYxcXNyADJ3GlcSKGAwLwllVC1vjIUHBWsFilKQdI8GA5IcpApeJQt8L09lmgkH
LZikoU5wjqcyAMMFrJIDPAKvCFletKSev1HBw8KrxtjZ2tvc3d5VyKtCKW3jfz4uMKmq3xu4N0nK
BVoJQmx2LGVOmrqNjjJf2hHAQo/eDwJGTKhQMcgQEEAnEjFS98+RnW3smGkZU6ncCWav/4wYOnAI
TihRL/4FEwbp28BXMMcoscQCVxlepL4IGDSCyJyVQOu0o7CjmLN50OZlqWmyFy5/6yBBuji0AxFR
M00oQAqNIstqI6qKHUsWRAEAvagsmfUEAImyxgbmUpJk3IklNUtJOUAVLoUr1+wqDGTE4zk+T6FG
uQb3SizBCwatiiUgCBN8vrz+zFjVyQ8FWkOlg4NQiZMB5QS8QO3mpOaKnL0Z2EKvNMSILEThKhCg
zMKPVxYJh23qm9KNW7pArPynMqZDiErsTMqI+LRi3QAgkFUbXpuFKhSYZALd0O5RKa2z9EYKBbpb
qxIKsjUPRgD7I2XYV6wyrOw92ykExP8NW4URhknC5dKGE4v4NENQj2jXjmfNgOZDaXb5glRmXQ33
YEWQYNcZFnrYcIQLNzyTFDQNkXIff0ExVlY4srziQk43inZgL4rwxxINMvpFFAz1KOODHiu+4aEw
NEjFl5B3JIKWKF3k6I9bfUGp5ZZcdunll5IA4cuHvQQJ5gcsoCWOOUwgltIwAKRxJgbIkJAQZEq0
2YliZnpZZ4BH3CnYOXldOUOfQoYDqF1LFHbXCrO8xmRsfoXDXJ6ChjCAH3QlhJcT6VWE6FCkfCco
CgrMFsROrIEX3o2whVjWDjoJccN3LdggSGXLCdLEgHr1lyU3O3QxhgohNKXJCWv8JQr/PDdaqd6w
2rj1inLiGeiCJoDspAoQlYE6QWLSECehcWIYxIQES6zhbn1iImTHEQyqJ4eIxJJoUBc+3CbBuwZE
V5cJPPkIjFDdeEabQbd6WgICTxiiz0f5dBKquXF6k4senwEhYGnKEFJeGrxUZy8dB8gmAXI/sPvH
ESfCwVt5hTgYiqQqtdRNHQIU1PJ33ZqmzgE90OwLaoJcnMop1WiMmgkPHQRIrwgFuNV90A3doNKT
mrKIN07AnGcI9BQjhCBN4RfA1qIZnMqorJCogKfGQnxSCDilTVIA0yl5ciTovgLuBDKFUDE9aQcw
9SA+rjSNf9/M1gxrj6VwDTS0IUSElMzBfsj0NFXR2kwsV1A5IF1grLgLL/r1R40BZEnuBWgmQEyb
jqRwSAt6bqMCOFkvKFN2GPPkUzIm/SCF8z8pVzpbjVnMsy0vOr1hw3SaSRUhpY09v0z0J1FnwzPl
fmh+xl4WtR0zGu24I4KbMQm3lnVu2oNWxI9W/lcyzA+mCKF4DBikxb/+UWtOGRiFP8qEwAayIgIA
Ow==

--0__=08BBE4E7DFFA23618f9e8a93df938690918c08BBE4E7DFFA2361
Content-type: image/gif; 
	name="ecblank.gif"
Content-Disposition: inline; filename="ecblank.gif"
Content-ID: <30__=08BBE4E7DFFA23618f9e8a93df938@us.ibm.com>
Content-transfer-encoding: base64

R0lGODlhEAABAIAAAAAAAP///yH5BAEAAAEALAAAAAAQAAEAAAIEjI8ZBQA7

--0__=08BBE4E7DFFA23618f9e8a93df938690918c08BBE4E7DFFA2361--