[OpenAFS-devel] write errors, servers going 'down'
Andrei Maslennikov
andrei.maslennikov@gmail.com
Wed, 23 May 2007 09:39:44 +0200
------=_Part_18263_18892788.1179905984913
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
We see exactly the same error as reported by Kirby Bakken in November 2006
(client looses contact with the fileserver under a heavy load). On fast
multicore
opterons we can reproduce it in 100% of cases. The error occures under a
heavy load
with an application that calculates the checksums of 3000+ files. Some
details:
- It happens with OAFS 1.4.1, 1.4.2 and 1.4.4
- It happens on different Red Hat kernels: 2.6.9-42.0.10.ELsmp,
2.6.9-55.ELsmp
- It is reproducible on several identical machines
- It happens with the most abundant afsd parameters, with cache on disk or
ramdisk
- It does not happen with small memcaches (65536,131072)
- It reappears with the memcache of 256MB
- On the fileserver side with verbose debug in FileLog everything is clean
- On the client side, we have captured a pair of fstrace outputs, they may
be
seen under http://afs.caspur.it/rtb (these are large files of 100+MB, in
a zipped
form; Rainer Toebbicke had pointed us at the point where the error
occured, he
was going to comment on it in a separate mail).
If somebody from the group wishes to debug it, we could provide him/her
with the
access to one of the machines in question, show how the error may be
reproduced
and give any needed support during the debug operations.
Andrei.
On 11/14/06, Kirby Bakken <kirbyb@us.ibm.com> wrote:
>
>
> More information.... Here's the 'format' of the write error messages:
>
> afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com(all multi-homed ip addresses down for the server)
> afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com(all multi-homed ip addresses down for the server)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: failed to store file (110)
> afs: file server 9.41.253.103 in cell austin.ibm.com is back up
> (multi-homed address; other same-host interfaces may still be down)
> afs: file server 9.41.253.103 in cell austin.ibm.com is back up
> (multi-homed address; other same-host interfaces may still be down)
> .......
>
------=_Part_18263_18892788.1179905984913
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
<br> We see exactly the same error as reported by <span class="gmail_quote">Kirby Bakken in November 2006<br> (client looses contact with the fileserver under a heavy load). On fast multicore <br> opterons we can reproduce it in 100% of cases. The error occures under a heavy load
<br> with an application that calculates the checksums of 3000+ files. Some details:<br><br> - It happens with OAFS 1.4.1, 1.4.2 and 1.4.4<br> - It happens on different Red Hat kernels: 2.6.9-42.0.10.ELsmp, 2.6.9-55.ELsmp
<br> - It is reproducible on several identical machines<br> - It happens with the most abundant afsd parameters, with cache on disk or ramdisk <br> - It does not happen with small memcaches (65536,131072)<br> - It reappears with the memcache of 256MB
<br> <br> - On the fileserver side with verbose debug in FileLog everything is clean<br> - On the client side, we have captured a pair of fstrace outputs, they may be <br> seen under <a href="http://afs.caspur.it/rtb">
http://afs.caspur.it/rtb</a> (these are large files of 100+MB, in a zipped<br> form; Rainer Toebbicke had pointed us at the point where the error occured, he<br> was going to comment on it in a separate mail). <br>
<br> If somebody from the group wishes to debug it, we could provide him/her with the<br> access to one of the machines in question, show how the error may be reproduced<br> and give any needed support during the debug operations.
<br><br> Andrei.<br> </span><br><div><span class="gmail_quote">On 11/14/06, <b class="gmail_sendername">Kirby Bakken</b> <<a href="mailto:kirbyb@us.ibm.com">kirbyb@us.ibm.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br><font face="sans-serif" size="2">More information.... Here's the
'format' of the write error messages:</font>
<br>
<br><font face="sans-serif" size="2">afs: Lost contact with <span id="st" name="st" class="st">file</span> server <a href="http://9.41.253.103" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">9.41.253.103
</a>
in cell <a href="http://austin.ibm.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">austin.ibm.com</a> (all multi-homed ip addresses down for the server)</font>
<br><font face="sans-serif" size="2">afs: Lost contact with <span id="st" name="st" class="st">file</span> server <a href="http://9.41.253.103" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">9.41.253.103
</a>
in cell <a href="http://austin.ibm.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">austin.ibm.com</a> (all multi-homed ip addresses down for the server)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">failed</span> <span id="st" name="st" class="st">to</span> <span id="st" name="st" class="st">store</span> <span id="st" name="st" class="st">file
</span> (110)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">file</span> server <a href="http://9.41.253.103" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">9.41.253.103</a> in cell
<a href="http://austin.ibm.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">austin.ibm.com</a> is back up (multi-homed address; other same-host interfaces
may still be down)</font>
<br><font face="sans-serif" size="2">afs: <span id="st" name="st" class="st">file</span> server <a href="http://9.41.253.103" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">9.41.253.103</a> in cell
<a href="http://austin.ibm.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">austin.ibm.com</a> is back up (multi-homed address; other same-host interfaces
may still be down)</font>
<br>
<font face="sans-serif" size="2">.......<br></font></blockquote></div><br>
------=_Part_18263_18892788.1179905984913--