[OpenAFS-devel] write errors, servers going 'down'
Kirby Bakken
kirbyb@us.ibm.com
Tue, 14 Nov 2006 11:08:27 -0600
This is a multipart message in MIME format.
--=_alternative 005E28BB86257226_=
Content-Type: text/plain; charset="US-ASCII"
More information.... Here's the 'format' of the write error messages:
afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com
(all multi-homed ip addresses down for the server)
afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com
(all multi-homed ip addresses down for the server)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: file server 9.41.253.103 in cell austin.ibm.com is back up
(multi-homed address; other same-host interfaces may still be down)
afs: file server 9.41.253.103 in cell austin.ibm.com is back up
(multi-homed address; other same-host interfaces may still be down)
At that point, here's a partial ps -ef list:
root 12749 1 0 10:18 ? 00:00:00 [afsd]
root 12751 1 0 10:18 ? 00:00:00 [afs_checkserver]
root 12753 1 0 10:18 ? 00:00:00 [afs_background]
root 12755 1 0 10:18 ? 00:00:00 [afs_background]
root 12757 1 0 10:18 ? 00:00:00 [afs_background]
root 12759 1 0 10:18 ? 00:00:00 [afs_background]
root 12761 1 0 10:18 ? 00:00:00 [afs_background]
root 12763 1 0 10:18 ? 00:00:00 [afs_background]
root 12765 1 0 10:18 ? 00:00:00 [afs_background]
root 12767 1 0 10:18 ? 00:00:00 [afs_background]
root 12769 1 0 10:18 ? 00:00:00 [afs_background]
root 12771 1 0 10:18 ? 00:00:00 [afs_background]
root 12773 1 0 10:18 ? 00:00:00 [afs_background]
root 12775 1 0 10:18 ? 00:00:00 [afs_background]
root 12777 1 0 10:18 ? 00:00:00 [afs_background]
root 12779 1 0 10:18 ? 00:00:00 [afs_background]
root 12781 1 0 10:18 ? 00:00:00 [afs_background]
root 12783 1 0 10:18 ? 00:00:00 [afs_background]
root 12821 1 0 10:18 ? 00:00:05 [afs_cachetrim]
Note that there are 16 'zombie' afs_background processes... the same
number of 'daemon' processes I specified in OPTIONS.
=======================
Kirby Bakken
ESW Build Architect
Rochester, MN
email: kirbyb@us.ibm.com
ezpage:kirbyb
507-253-4549 / Tie: 553-4549
Fax: 507-253-3495
......one more straw can't possibly matter....
Kirby Bakken/Rochester/IBM@IBMUS
Sent by: openafs-devel-admin@openafs.org
11/13/2006 12:54 PM
To
openafs-devel@openafs.org
cc
Subject
[OpenAFS-devel] write errors, servers going 'down'
Help!
I'm running RHEL4 U4 (uname -r => 2.6.9-42.0.3.ELsmp) x86_64 on one of
many dual Opteron 'linux' servers. Our servers are running 'some' level
of afs... that may be important, but for now I'm trying to figure out
where to start debug....
I get these messages in 'demsg':
afs: Lost contact with file server 9.10.228.186 in cell rchland.ibm.com
(all multi-homed ip addresses down for the server)
afs: Lost contact with file server 9.10.228.186 in cell rchland.ibm.com
(all multi-homed ip addresses down for the server)
afs: file server 9.10.228.186 in cell rchland.ibm.com is back up
(multi-homed address; other same-host interfaces may still be down)
afs: file server 9.10.228.186 in cell rchland.ibm.com is back up
(multi-homed address; other same-host interfaces may still be down)
I'm also seeing 'write' errors in the dmesg log, but don't currently have
an exact 'paste' of that info....
These errors only occur at 'high' load. Multiple processes
writing/reading to the same afs volume. I'm running these options and
cache settings:
LARGE="-stat 2800 -dcache 2400 -daemons 5 -volumes 128"
I had been running with 'medium' settings, and that's when I saw the write
errors... now I just see the 'Lost contact' errors, and 'failed to store
file' in the program writing files. (we're compiling/linking at the time
these errors occur).
I've got the cache size 'set':
CACHESIZE=600000
although when I cat out the cacheinfo file I get this:
cat /usr/vice/etc/cacheinfo
/afs:/usr/vice/cache:3628512
I'm seeing these problems both with
'kernel-smp-module-openafs-1.4.0-2.6.9_42.0.3.EL_6_rhel4' and with
'openafs-kernel-smp-1.4.2-2.6.9_42.ELsmp_1.x86_64'
We had been seeing similar problems last March on RHEL4 U3, but an
'intermediate' AFS build of 'openafs-1.4.1rc2-rhel4.0.x86_64' seemed to
work 'most of the time'... (we get hangs about once every two weeks on
each of 6 of the dual Opteron servers, and can't even log-into a local
console to gather info.. so we're not sure if this is afs related or
what).
What do I do to figure this out, or make it go away? Is there a 'Having
problems with openafs? Here's what to try...." set of instructions
somewhere that I've missed?
Thank you very much in advance for any help.
=======================
Kirby Bakken
ESW Build Architect
Rochester, MN
email: kirbyb@us.ibm.com
ezpage:kirbyb
507-253-4549 / Tie: 553-4549
Fax: 507-253-3495
......one more straw can't possibly matter....
--=_alternative 005E28BB86257226_=
Content-Type: text/html; charset="US-ASCII"
<br><font size=2 face="sans-serif">More information.... Here's the
'format' of the write error messages:</font>
<br>
<br><font size=2 face="sans-serif">afs: Lost contact with file server 9.41.253.103
in cell austin.ibm.com (all multi-homed ip addresses down for the server)</font>
<br><font size=2 face="sans-serif">afs: Lost contact with file server 9.41.253.103
in cell austin.ibm.com (all multi-homed ip addresses down for the server)</font>
<br><font size=2 face="sans-serif">afs: failed to store file (110)</font>
<br><font size=2 face="sans-serif">afs: failed to store file (110)</font>
<br><font size=2 face="sans-serif">afs: failed to store file (110)</font>
<br><font size=2 face="sans-serif">afs: failed to store file (110)</font>
<br><font size=2 face="sans-serif">afs: failed to store file (110)</font>
<br><font size=2 face="sans-serif">afs: failed to store file (110)</font>
<br><font size=2 face="sans-serif">afs: failed to store file (110)</font>
<br><font size=2 face="sans-serif">afs: failed to store file (110)</font>
<br><font size=2 face="sans-serif">afs: file server 9.41.253.103 in cell
austin.ibm.com is back up (multi-homed address; other same-host interfaces
may still be down)</font>
<br><font size=2 face="sans-serif">afs: file server 9.41.253.103 in cell
austin.ibm.com is back up (multi-homed address; other same-host interfaces
may still be down)</font>
<br>
<br><font size=2 face="sans-serif">At that point, here's a partial ps -ef
list:</font>
<br>
<br><font size=2 face="sans-serif">root 12749
1 0 10:18 ? 00:00:00 [afsd]</font>
<br><font size=2 face="sans-serif">root 12751
1 0 10:18 ? 00:00:00 [afs_checkserver]</font>
<br><font size=2 face="sans-serif">root 12753
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12755
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12757
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12759
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12761
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12763
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12765
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12767
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12769
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12771
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12773
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12775
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12777
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12779
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12781
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12783
1 0 10:18 ? 00:00:00 [afs_background]</font>
<br><font size=2 face="sans-serif">root 12821
1 0 10:18 ? 00:00:05 [afs_cachetrim]</font>
<br>
<br><font size=2 face="sans-serif">Note that there are 16 'zombie' afs_background
processes... the same number of 'daemon' processes I specified in
OPTIONS.</font>
<br><font size=2 face="sans-serif"><br>
=======================<br>
Kirby Bakken<br>
ESW Build Architect<br>
Rochester, MN<br>
email: kirbyb@us.ibm.com<br>
ezpage:kirbyb<br>
507-253-4549 / Tie: 553-4549<br>
Fax: 507-253-3495<br>
<br>
......one more straw can't possibly matter....<br>
</font>
<br>
<br>
<br>
<table width=100%>
<tr valign=top>
<td width=40%><font size=1 face="sans-serif"><b>Kirby Bakken/Rochester/IBM@IBMUS</b>
</font>
<br><font size=1 face="sans-serif">Sent by: openafs-devel-admin@openafs.org</font>
<p><font size=1 face="sans-serif">11/13/2006 12:54 PM</font>
<td width=59%>
<table width=100%>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">To</font></div>
<td><font size=1 face="sans-serif">openafs-devel@openafs.org</font>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">cc</font></div>
<td>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">Subject</font></div>
<td><font size=1 face="sans-serif">[OpenAFS-devel] write errors, servers
going 'down'</font></table>
<br>
<table>
<tr valign=top>
<td>
<td></table>
<br></table>
<br>
<br>
<br><font size=2 face="sans-serif"><br>
Help!</font><font size=3> <br>
</font><font size=2 face="sans-serif"><br>
I'm running RHEL4 U4 (uname -r => 2.6.9-42.0.3.ELsmp) x86_64 on one
of many dual Opteron 'linux' servers. Our servers are running 'some'
level of afs... that may be important, but for now I'm trying to
figure out where to start debug....</font><font size=3> <br>
</font><font size=2 face="sans-serif"><br>
I get these messages in 'demsg':</font><font size=3> <br>
</font><font size=2 face="sans-serif"><br>
afs: Lost contact with file server 9.10.228.186 in cell rchland.ibm.com
(all multi-homed ip addresses down for the server)</font><font size=3>
</font><font size=2 face="sans-serif"><br>
afs: Lost contact with file server 9.10.228.186 in cell rchland.ibm.com
(all multi-homed ip addresses down for the server)</font><font size=3>
</font><font size=2 face="sans-serif"><br>
afs: file server 9.10.228.186 in cell rchland.ibm.com is back up (multi-homed
address; other same-host interfaces may still be down)</font><font size=3>
</font><font size=2 face="sans-serif"><br>
afs: file server 9.10.228.186 in cell rchland.ibm.com is back up (multi-homed
address; other same-host interfaces may still be down)</font><font size=3>
<br>
</font><font size=2 face="sans-serif"><br>
I'm also seeing 'write' errors in the dmesg log, but don't currently have
an exact 'paste' of that info....</font><font size=3> <br>
</font><font size=2 face="sans-serif"><br>
These errors only occur at 'high' load. Multiple processes writing/reading
to the same afs volume. I'm running these options and cache settings:</font><font size=3>
<br>
</font><font size=2 face="sans-serif"><br>
LARGE="-stat 2800 -dcache 2400 -daemons 5 -volumes 128"</font><font size=3>
<br>
</font><font size=2 face="sans-serif"><br>
I had been running with 'medium' settings, and that's when I saw the write
errors... now I just see the 'Lost contact' errors, and 'failed to
store file' in the program writing files. (we're compiling/linking at the
time these errors occur).</font><font size=3> <br>
</font><font size=2 face="sans-serif"><br>
I've got the cache size 'set':</font><font size=3> <br>
</font><font size=2 face="sans-serif"><br>
CACHESIZE=600000</font><font size=3> <br>
</font><font size=2 face="sans-serif"><br>
although when I cat out the cacheinfo file I get this:</font><font size=3>
<br>
</font><font size=2 face="sans-serif"><br>
cat /usr/vice/etc/cacheinfo</font><font size=3> </font><font size=2 face="sans-serif"><br>
/afs:/usr/vice/cache:3628512</font><font size=3> <br>
</font><font size=2 face="sans-serif"><br>
I'm seeing these problems both with 'kernel-smp-module-openafs-1.4.0-2.6.9_42.0.3.EL_6_rhel4'
and with 'openafs-kernel-smp-1.4.2-2.6.9_42.ELsmp_1.x86_64'</font><font size=3>
<br>
</font><font size=2 face="sans-serif"><br>
We had been seeing similar problems last March on RHEL4 U3, but an 'intermediate'
AFS build of 'openafs-1.4.1rc2-rhel4.0.x86_64' seemed to work 'most of
the time'... (we get hangs about once every two weeks on each of
6 of the dual Opteron servers, and can't even log-into a local console
to gather info.. so we're not sure if this is afs related or what).</font><font size=3>
<br>
</font><font size=2 face="sans-serif"><br>
What do I do to figure this out, or make it go away? Is there a 'Having
problems with openafs? Here's what to try...." set of instructions
somewhere that I've missed?</font><font size=3> <br>
</font><font size=2 face="sans-serif"><br>
Thank you very much in advance for any help.</font><font size=3> </font><font size=2 face="sans-serif"><br>
<br>
=======================<br>
Kirby Bakken<br>
ESW Build Architect<br>
Rochester, MN<br>
email: kirbyb@us.ibm.com<br>
ezpage:kirbyb<br>
507-253-4549 / Tie: 553-4549<br>
Fax: 507-253-3495<br>
<br>
......one more straw can't possibly matter....</font>
<br>
--=_alternative 005E28BB86257226_=--