[OpenAFS] Strange regular afs failure

Derrick Brashear shadow@gmail.com
Mon, 24 Sep 2007 08:54:58 -0400


------=_Part_3692_22024111.1190638498853
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

The afs server threads servicing that client blocked and there was a race
caught when the client tried making more of the same RPC while the previous
one was still being serviced.


On 9/24/07, Frank Burkhardt <fbo2@gmx.net> wrote:
>
> Hi,
>
> an afs client of mine does some cron job on a regular basis (once per 5
> minutes) which involves reading from and writing to a single afs volume.
>
> Every monday Morning ~ 7:30 the job failes with IO errors. Client logs
> shows several "kernel: afs: failed to store file (5)" messages, FileLog
> on the volumes Fileserver shows this:
>
> Mon Sep 24 07:33:30 2007 FindClient: stillborn client 8221900(1ef6f034);
> conn 823f0d0 (host 10.0.54.228:7001) had client 8221c48(1ef6f034)
> Mon Sep 24 07:33:30 2007 FindClient: stillborn client 82215b8(1ef6f03c);
> conn 823fd80 (host 10.0.54.228:7001) had client 8221900(1ef6f03c)
> Mon Sep 24 07:33:30 2007 FindClient: stillborn client 8220fd0(1ef6f028);
> conn 823d0f0 (host 10.0.54.228:7001) had client 82215b8(1ef6f028)
>
> The fileservers is set to automatic restart at 01:45 the same day which
> means, the job ran several times successfully before it failed after the
> restart. Restart times of my DB-servers are set to sunday morning.
>
> I checked the network - client and server are connected via a single
> switch
> which is managed and doesn't show any log entry for at least 1 hour around
> the event. I can also rule out other cron jobs on client and server - none
> of them runs near 07:30 .
>
> The only timely related event is one of our NFS-servers's restart which is
> done on a regular basis. The NFS server returned seconds before the
> afs-failure:
>
> Sep 24 06:09:06 hagen kernel: nfs: server helena not responding, still
> trying
> [...]
> Sep 24 07:33:27 hagen kernel: nfs: server helena OK
> Sep 24 07:33:33 hagen kernel: afs: failed to store file (5)
>
> What do the logentries on the AFS server mean? Does anyone have an idea,
> where to look for the cause of the problem?
>
> Regards,
>
> Frank
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>

------=_Part_3692_22024111.1190638498853
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

The afs server threads servicing that client blocked and there was a race caught when the client tried making more of the same RPC while the previous one was still being serviced.<br><br><br><div><span class="gmail_quote">
On 9/24/07, <b class="gmail_sendername">Frank Burkhardt</b> &lt;<a href="mailto:fbo2@gmx.net">fbo2@gmx.net</a>&gt; wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi,<br><br>an afs client of mine does some cron job on a regular basis (once per 5<br>minutes) which involves reading from and writing to a single afs volume.<br><br>Every monday Morning ~ 7:30 the job failes with IO errors. Client logs
<br>shows several &quot;kernel: afs: failed to store file (5)&quot; messages, FileLog<br>on the volumes Fileserver shows this:<br><br> Mon Sep 24 07:33:30 2007 FindClient: stillborn client 8221900(1ef6f034); conn 823f0d0 (host 
<a href="http://10.0.54.228:7001">10.0.54.228:7001</a>) had client 8221c48(1ef6f034)<br> Mon Sep 24 07:33:30 2007 FindClient: stillborn client 82215b8(1ef6f03c); conn 823fd80 (host <a href="http://10.0.54.228:7001">10.0.54.228:7001
</a>) had client 8221900(1ef6f03c)<br> Mon Sep 24 07:33:30 2007 FindClient: stillborn client 8220fd0(1ef6f028); conn 823d0f0 (host <a href="http://10.0.54.228:7001">10.0.54.228:7001</a>) had client 82215b8(1ef6f028)<br><br>
The fileservers is set to automatic restart at 01:45 the same day which<br>means, the job ran several times successfully before it failed after the<br>restart. Restart times of my DB-servers are set to sunday morning.<br>
<br>I checked the network - client and server are connected via a single switch<br>which is managed and doesn&#39;t show any log entry for at least 1 hour around<br>the event. I can also rule out other cron jobs on client and server - none
<br>of them runs near 07:30 .<br><br>The only timely related event is one of our NFS-servers&#39;s restart which is<br>done on a regular basis. The NFS server returned seconds before the<br>afs-failure:<br><br> Sep 24 06:09:06 hagen kernel: nfs: server helena not responding, still trying
<br> [...]<br> Sep 24 07:33:27 hagen kernel: nfs: server helena OK<br> Sep 24 07:33:33 hagen kernel: afs: failed to store file (5)<br><br>What do the logentries on the AFS server mean? Does anyone have an idea,<br>where to look for the cause of the problem?
<br><br>Regards,<br><br>Frank<br>_______________________________________________<br>OpenAFS-info mailing list<br><a href="mailto:OpenAFS-info@openafs.org">OpenAFS-info@openafs.org</a><br><a href="https://lists.openafs.org/mailman/listinfo/openafs-info">
https://lists.openafs.org/mailman/listinfo/openafs-info</a><br></blockquote></div><br>

------=_Part_3692_22024111.1190638498853--