[OpenAFS-devel] 50 second fetch-data?

Sven Oehme oehmes@de.ibm.com
Fri, 7 Oct 2005 15:25:29 +0200


This is a multipart message in MIME format.
--=_alternative 0049BADFC1257093_=
Content-Type: text/plain; charset="US-ASCII"

Oh,

just reading the mail from the beginning :-) 
may be we have two bugs here, my bug also reduce the performance , but i 
have no 50 sec delay, but i see the same messages (hundreds of them)  .
i start multiple batch jobs from 1 client (different processes) to 1 
server to 1 volume  ..

rxdebug on the client if this helps somebody .. :


testblade11:~ # rxdebug localhost 7001 -rxstats -noconns -long
Trying 127.0.0.1 (port 7001):
Free packets: 129, packet reclaims: 0, calls: 55, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
rx stats: free packets 129, allocs 452504, alloc-failures(rcv 0/0,send 
575/0,ack 0)
   greedy 0, bogusReads 0 (last from host 0), noPackets 0, noBuffers 0, 
selects 0, sendSelects 0
   packets read: data 8585 ack 124336 busy 0 abort 0 ackall 0 challenge 53 
response 0 debug 1420 params 0 unused 0 unused 0 unused 0 version 0
   other read counters: data 8585, ack 124002, dup 0 spurious 333 dally 1
   packets sent: data 114805 ack 8529 busy 0 abort 0 ackall 0 challenge 0 
response 53 debug 0 params 0 unused 0 unused 0 unused 0 version 0
   other send counters: ack 8529, data 870762 (not resends), resends 0, 
pushed 0, acked&ignored 340943
        (these should be small) sendFailed 0, fatalErrors 0
   Average rtt is 0.001, with 26815 samples
   Minimum rtt is 0.000, maximum is 0.095
   1 server connections, 29 client connections, 2 peer structs, 47 call 
structs, 0 free call structs


Sven




Sven Oehme/Germany/IBM@IBMDE 
Sent by: openafs-devel-admin@openafs.org
10/07/05 03:04 PM

To
Jeffrey Altman <jaltman@secure-endpoints.com>
cc
Harald Barth <haba@pdc.kth.se>, openafs-devel@openafs.org, rees@umich.edu, 
psomogyi@gamax.hu
Subject
Re: [OpenAFS-devel] 50 second fetch-data?







Hi Jeffrey, 

Peter and i work on that bug .. i have a test environment where i can 
reproduce the bug within 2 sec . 
if anybody like to assist us i can provide a tcpdump while it happens ..

Sven 



Jeffrey Altman <jaltman@secure-endpoints.com> 
Sent by: openafs-devel-admin@openafs.org 
10/07/05 02:22 PM 


To
Harald Barth <haba@pdc.kth.se> 
cc
rees@umich.edu, openafs-devel@openafs.org 
Subject
Re: [OpenAFS-devel] 50 second fetch-data?








Harald Barth wrote:

> You probably mean stuff like this:
> 
> Wed Oct  5 17:31:21 2005 FindClient: client 8320a78(6d5cb8f8) already 
had conn a7071568 (host 3fdded82), stolen by client 8320a78(6d5cb8f8)


> I have only ONE such log line and not for the time frame in question.
> 3fdded82 is my laptop 130.237.221.63 when at work. But I have no such
> message for any of its other IPs which would be *eded82 (130.237.237.*)
> - my laptop when at home.

This log message is not a symptom of the bug that was fixed related to
UUID collision.   This problem you are seeing may or may not be related
and it may or may not be an actual bug.

> I moved my H.haba.mail volume to another server which allows me to gdb
> and stop the fileserver without been lynched but of course the
> problems dissapeared when I did that. Probably I need to use up some
> kind of resource in the fileserver/rx first. I don't know how without
> letting loose real users. I know I have many connections from many
> clients. But a lot of free threads and no CPU or I/O load to speek of.
> Feel free to run rxdebug against houting.pdc.kth.se if you think you
> see something that I don't. Any tips how to collect statistics?
> 
> Harald.

I doubt moving your volume is going to help track down the problem.
You are not going to have lots of other users connecting to the new 
server.

I don't think we need to be able to stop the service.  However, it would
be useful to see what the server is doing in Ethereal.

Jeffrey Altman




--=_alternative 0049BADFC1257093_=
Content-Type: text/html; charset="US-ASCII"


<br><font size=2 face="sans-serif">Oh,</font>
<br>
<br><font size=2 face="sans-serif">just reading the mail from the beginning
:-) </font>
<br><font size=2 face="sans-serif">may be we have two bugs here, my bug
also reduce the performance , but i have no 50 sec delay, but i see the
same messages (hundreds of them) &nbsp;.<br>
</font><font size=2 face="Arial">i start multiple batch jobs from 1 client
(different processes) to 1 server to 1 volume &nbsp;..</font>
<br>
<br><font size=2 face="Arial">rxdebug on the client if this helps somebody
.. :</font>
<br>
<br>
<br><font size=2 face="Arial">testblade11:~ # rxdebug localhost 7001 -rxstats
-noconns -long</font>
<br><font size=2 face="Arial">Trying 127.0.0.1 (port 7001):</font>
<br><font size=2 face="Arial">Free packets: 129, packet reclaims: 0, calls:
55, used FDs: 64</font>
<br><font size=2 face="Arial">not waiting for packets.</font>
<br><font size=2 face="Arial">0 calls waiting for a thread</font>
<br><font size=2 face="Arial">1 threads are idle</font>
<br><font size=2 face="Arial">rx stats: free packets 129, allocs 452504,
alloc-failures(rcv 0/0,send 575/0,ack 0)</font>
<br><font size=2 face="Arial">&nbsp; &nbsp;greedy 0, bogusReads 0 (last
from host 0), noPackets 0, noBuffers 0, selects 0, sendSelects 0</font>
<br><font size=2 face="Arial">&nbsp; &nbsp;packets read: data 8585 ack
124336 busy 0 abort 0 ackall 0 challenge 53 response 0 debug 1420 params
0 unused 0 unused 0 unused 0 version 0</font>
<br><font size=2 face="Arial">&nbsp; &nbsp;other read counters: data 8585,
ack 124002, dup 0 spurious 333 dally 1</font>
<br><font size=2 face="Arial">&nbsp; &nbsp;packets sent: data 114805 ack
8529 busy 0 abort 0 ackall 0 challenge 0 response 53 debug 0 params 0 unused
0 unused 0 unused 0 version 0</font>
<br><font size=2 face="Arial">&nbsp; &nbsp;other send counters: ack 8529,
data 870762 (not resends), resends 0, pushed 0, acked&amp;ignored 340943</font>
<br><font size=2 face="Arial">&nbsp; &nbsp; &nbsp; &nbsp; (these should
be small) sendFailed 0, fatalErrors 0</font>
<br><font size=2 face="Arial">&nbsp; &nbsp;Average rtt is 0.001, with 26815
samples</font>
<br><font size=2 face="Arial">&nbsp; &nbsp;Minimum rtt is 0.000, maximum
is 0.095</font>
<br><font size=2 face="Arial">&nbsp; &nbsp;1 server connections, 29 client
connections, 2 peer structs, 47 call structs, 0 free call structs</font>
<br>
<br><font size=2 face="Arial"><br>
Sven<br>
</font>
<br>
<br>
<br>
<table width=100%>
<tr valign=top>
<td width=40%><font size=1 face="sans-serif"><b>Sven Oehme/Germany/IBM@IBMDE</b>
</font>
<br><font size=1 face="sans-serif">Sent by: openafs-devel-admin@openafs.org</font>
<p><font size=1 face="sans-serif">10/07/05 03:04 PM</font>
<td width=59%>
<table width=100%>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">To</font></div>
<td valign=top><font size=1 face="sans-serif">Jeffrey Altman &lt;jaltman@secure-endpoints.com&gt;</font>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">cc</font></div>
<td valign=top><font size=1 face="sans-serif">Harald Barth &lt;haba@pdc.kth.se&gt;,
openafs-devel@openafs.org, rees@umich.edu, psomogyi@gamax.hu</font>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">Subject</font></div>
<td valign=top><font size=1 face="sans-serif">Re: [OpenAFS-devel] 50 second
fetch-data?</font></table>
<br>
<table>
<tr valign=top>
<td>
<td></table>
<br></table>
<br>
<br>
<br><font size=2 face="sans-serif"><br>
Hi Jeffrey, </font><font size=3><br>
</font><font size=2 face="sans-serif"><br>
Peter and i work on that bug .. i have a test environment where i can reproduce
the bug within 2 sec .</font><font size=3> </font><font size=2 face="sans-serif"><br>
if anybody like to assist us i can provide a tcpdump while it happens ..</font><font size=2 face="Arial"><br>
<br>
Sven </font><font size=3><br>
<br>
<br>
</font>
<table width=100%>
<tr valign=top>
<td width=46%><font size=1 face="sans-serif"><b>Jeffrey Altman &lt;jaltman@secure-endpoints.com&gt;</b>
<br>
Sent by: openafs-devel-admin@openafs.org</font><font size=3> </font>
<p><font size=1 face="sans-serif">10/07/05 02:22 PM</font><font size=3>
</font>
<td width=53%>
<br>
<table width=100%>
<tr>
<td width=14%>
<div align=right><font size=1 face="sans-serif">To</font></div>
<td width=85% valign=top><font size=1 face="sans-serif">Harald Barth &lt;haba@pdc.kth.se&gt;</font><font size=3>
</font>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">cc</font></div>
<td valign=top><font size=1 face="sans-serif">rees@umich.edu, openafs-devel@openafs.org</font><font size=3>
</font>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">Subject</font></div>
<td valign=top><font size=1 face="sans-serif">Re: [OpenAFS-devel] 50 second
fetch-data?</font></table>
<br>
<br>
<table width=100%>
<tr valign=top>
<td width=49%>
<td width=50%></table>
<br></table>
<br><font size=3><br>
<br>
</font><font size=2><tt><br>
Harald Barth wrote:<br>
<br>
&gt; You probably mean stuff like this:<br>
&gt; <br>
&gt; Wed Oct &nbsp;5 17:31:21 2005 FindClient: client 8320a78(6d5cb8f8)
already had conn a7071568 (host 3fdded82), stolen by client 8320a78(6d5cb8f8)<br>
<br>
<br>
&gt; I have only ONE such log line and not for the time frame in question.<br>
&gt; 3fdded82 is my laptop 130.237.221.63 when at work. But I have no such<br>
&gt; message for any of its other IPs which would be *eded82 (130.237.237.*)<br>
&gt; - my laptop when at home.<br>
<br>
This log message is not a symptom of the bug that was fixed related to<br>
UUID collision. &nbsp; This problem you are seeing may or may not be related<br>
and it may or may not be an actual bug.<br>
<br>
&gt; I moved my H.haba.mail volume to another server which allows me to
gdb<br>
&gt; and stop the fileserver without been lynched but of course the<br>
&gt; problems dissapeared when I did that. Probably I need to use up some<br>
&gt; kind of resource in the fileserver/rx first. I don't know how without<br>
&gt; letting loose real users. I know I have many connections from many<br>
&gt; clients. But a lot of free threads and no CPU or I/O load to speek
of.<br>
&gt; Feel free to run rxdebug against houting.pdc.kth.se if you think you<br>
&gt; see something that I don't. Any tips how to collect statistics?<br>
&gt; <br>
&gt; Harald.<br>
<br>
I doubt moving your volume is going to help track down the problem.<br>
You are not going to have lots of other users connecting to the new server.<br>
<br>
I don't think we need to be able to stop the service. &nbsp;However, it
would<br>
be useful to see what the server is doing in Ethereal.<br>
<br>
Jeffrey Altman<br>
<br>
</tt></font><font size=3><br>
</font>
<br>
--=_alternative 0049BADFC1257093_=--