[OpenAFS-devel] write errors, servers going 'down'

Andrei Maslennikov andrei.maslennikov@gmail.com
Wed, 23 May 2007 09:39:44 +0200


------=_Part_18263_18892788.1179905984913
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

  We see exactly the same error as reported by Kirby Bakken in November 2006
  (client looses contact with the fileserver under a heavy load). On fast
multicore
  opterons we can reproduce it in 100% of cases. The error occures under a
heavy load
  with an application that calculates the checksums of 3000+ files. Some
details:

  - It happens with OAFS 1.4.1, 1.4.2 and 1.4.4
  - It happens on different Red Hat kernels: 2.6.9-42.0.10.ELsmp,
2.6.9-55.ELsmp
  - It is reproducible on several identical machines
  - It happens with the most abundant afsd parameters, with cache on disk or
ramdisk
  - It does not happen with small memcaches (65536,131072)
  - It reappears with the memcache of 256MB

  - On the fileserver side with verbose debug in FileLog everything is clean
  - On the client side, we have captured a pair of fstrace outputs, they may
be
    seen under http://afs.caspur.it/rtb (these are large files of 100+MB, in
a zipped
    form; Rainer Toebbicke had pointed us at the point where the error
occured, he
    was going to comment on it in a separate mail).

  If somebody from the group wishes to debug it, we could provide him/her
with the
  access to one of the machines in question, show how the error may be
reproduced
  and give any needed support during the debug operations.

  Andrei.

On 11/14/06, Kirby Bakken <kirbyb@us.ibm.com> wrote:
>
>
> More information....  Here's the 'format' of the write error messages:
>
> afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com(all multi-homed ip addresses down for the server)
> afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com(all multi-homed ip addresses down for the server)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: file server 9.41.253.103 in cell austin.ibm.com is back up
> (multi-homed address; other same-host interfaces may still be down)
> afs: file server 9.41.253.103 in cell austin.ibm.com is back up
> (multi-homed address; other same-host interfaces may still be down)
> .......
>

------=_Part_18263_18892788.1179905984913
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

<br>&nbsp; We see exactly the same error as reported by <span class="gmail_quote">Kirby Bakken in November 2006<br>&nbsp; (client looses contact with the fileserver under a heavy load). On fast multicore <br>&nbsp; opterons we can reproduce it in 100% of cases. The error occures under a heavy load 
<br>&nbsp; with an application that calculates the checksums of 3000+ files. Some details:<br><br>&nbsp; - It happens with OAFS 1.4.1, 1.4.2 and 1.4.4<br>&nbsp; - It happens on different Red Hat kernels: 2.6.9-42.0.10.ELsmp, 2.6.9-55.ELsmp
 <br>&nbsp; - It is reproducible on several identical machines<br>&nbsp; - It happens with the most abundant afsd parameters, with cache on disk or ramdisk <br>&nbsp; - It does not happen with small memcaches (65536,131072)<br>&nbsp; - It reappears with the memcache of 256MB 
<br>&nbsp; <br>&nbsp; - On the fileserver side with verbose debug in FileLog everything is clean<br>&nbsp; - On the client side, we have captured a pair of fstrace outputs, they may be <br>&nbsp;&nbsp;&nbsp; seen under <a href="http://afs.caspur.it/rtb">
http://afs.caspur.it/rtb</a> (these are large files of 100+MB, in a zipped<br>&nbsp;&nbsp;&nbsp; form; Rainer Toebbicke had pointed us at the point where the error occured, he<br>&nbsp; &nbsp; was going to comment on it in a separate mail). <br>&nbsp; 
<br>&nbsp; If somebody from the group wishes to debug it, we could provide him/her with the<br>&nbsp; access to one of the machines in question, show how the error may be reproduced<br>&nbsp; and give any needed support during the debug operations.
<br><br>&nbsp; Andrei.<br>&nbsp;&nbsp; </span><br><div><span class="gmail_quote">On 11/14/06, <b class="gmail_sendername">Kirby Bakken</b> &lt;<a href="mailto:kirbyb@us.ibm.com">kirbyb@us.ibm.com</a>&gt; wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<br><font face="sans-serif" size="2">More information.... &nbsp;Here&#39;s the
&#39;format&#39; of the write error messages:</font>
<br>
<br><font face="sans-serif" size="2">afs: Lost contact with <span id="st" name="st" class="st">file</span> server <a href="http://9.41.253.103" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">9.41.253.103
</a>
in cell <a href="http://austin.ibm.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">austin.ibm.com</a> (all multi-homed ip addresses down for the server)</font>
<br><font face="sans-serif" size="2">afs: Lost contact with <span id="st" name="st" class="st">file</span> server <a href="http://9.41.253.103" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">9.41.253.103
</a>
in cell <a href="http://austin.ibm.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">austin.ibm.com</a> (all multi-homed ip addresses down for the server)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">file</span> server <a href="http://9.41.253.103" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">9.41.253.103</a> in cell
<a href="http://austin.ibm.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">austin.ibm.com</a> is back up (multi-homed address; other same-host interfaces
may still be down)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">file</span> server <a href="http://9.41.253.103" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">9.41.253.103</a> in cell
<a href="http://austin.ibm.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">austin.ibm.com</a> is back up (multi-homed address; other same-host interfaces
may still be down)</font>
<br>
<font face="sans-serif" size="2">.......<br></font></blockquote></div><br>

------=_Part_18263_18892788.1179905984913--