[OpenAFS-devel] write errors, servers going 'down'

Kirby Bakken kirbyb@us.ibm.com
Mon, 13 Nov 2006 12:54:48 -0600


This is a multipart message in MIME format.
--=_alternative 0067E52B86257225_=
Content-Type: text/plain; charset="US-ASCII"

Help!

I'm running RHEL4 U4 (uname -r => 2.6.9-42.0.3.ELsmp) x86_64 on one of 
many dual Opteron 'linux' servers.  Our servers are running 'some' level 
of afs...  that may be important, but for now I'm trying to figure out 
where to start debug....

I get these messages in 'demsg':

afs: Lost contact with file server 9.10.228.186 in cell rchland.ibm.com 
(all multi-homed ip addresses down for the server)
afs: Lost contact with file server 9.10.228.186 in cell rchland.ibm.com 
(all multi-homed ip addresses down for the server)
afs: file server 9.10.228.186 in cell rchland.ibm.com is back up 
(multi-homed address; other same-host interfaces may still be down)
afs: file server 9.10.228.186 in cell rchland.ibm.com is back up 
(multi-homed address; other same-host interfaces may still be down)

I'm also seeing 'write' errors in the dmesg log, but don't currently have 
an exact 'paste' of that info....

These errors only occur at 'high' load.  Multiple processes 
writing/reading to the same afs volume.  I'm running these options and 
cache settings:

LARGE="-stat 2800 -dcache 2400 -daemons 5 -volumes 128"

I had been running with 'medium' settings, and that's when I saw the write 
errors...  now I just see the 'Lost contact' errors, and 'failed to store 
file' in the program writing files. (we're compiling/linking at the time 
these errors occur).

I've got the cache size 'set':

CACHESIZE=600000

although when I cat out the cacheinfo file I get this:

cat /usr/vice/etc/cacheinfo
/afs:/usr/vice/cache:3628512

I'm seeing these problems both with 
'kernel-smp-module-openafs-1.4.0-2.6.9_42.0.3.EL_6_rhel4' and with 
'openafs-kernel-smp-1.4.2-2.6.9_42.ELsmp_1.x86_64'

We had been seeing similar problems last March on RHEL4 U3, but an 
'intermediate' AFS build of 'openafs-1.4.1rc2-rhel4.0.x86_64' seemed to 
work 'most of the time'...  (we get hangs about once every two weeks on 
each of 6 of the dual Opteron servers, and can't even log-into a local 
console to gather info..  so we're not sure if this is afs related or 
what).

What do I do to figure this out, or make it go away?  Is there a 'Having 
problems with openafs?  Here's what to try...." set of instructions 
somewhere that I've missed?

Thank you very much in advance for any help.

=======================
Kirby Bakken
ESW Build Architect
Rochester, MN
email: kirbyb@us.ibm.com
ezpage:kirbyb
507-253-4549 / Tie:  553-4549
Fax:  507-253-3495

......one more straw can't possibly matter....

--=_alternative 0067E52B86257225_=
Content-Type: text/html; charset="US-ASCII"


<br><font size=2 face="sans-serif">Help!</font>
<br>
<br><font size=2 face="sans-serif">I'm running RHEL4 U4 (uname -r =&gt;
2.6.9-42.0.3.ELsmp) x86_64 on one of many dual Opteron 'linux' servers.
&nbsp;Our servers are running 'some' level of afs... &nbsp;that may be
important, but for now I'm trying to figure out where to start debug....</font>
<br>
<br><font size=2 face="sans-serif">I get these messages in 'demsg':</font>
<br>
<br><font size=2 face="sans-serif">afs: Lost contact with file server 9.10.228.186
in cell rchland.ibm.com (all multi-homed ip addresses down for the server)</font>
<br><font size=2 face="sans-serif">afs: Lost contact with file server 9.10.228.186
in cell rchland.ibm.com (all multi-homed ip addresses down for the server)</font>
<br><font size=2 face="sans-serif">afs: file server 9.10.228.186 in cell
rchland.ibm.com is back up (multi-homed address; other same-host interfaces
may still be down)</font>
<br><font size=2 face="sans-serif">afs: file server 9.10.228.186 in cell
rchland.ibm.com is back up (multi-homed address; other same-host interfaces
may still be down)</font>
<br>
<br><font size=2 face="sans-serif">I'm also seeing 'write' errors in the
dmesg log, but don't currently have an exact 'paste' of that info....</font>
<br>
<br><font size=2 face="sans-serif">These errors only occur at 'high' load.
&nbsp;Multiple processes writing/reading to the same afs volume. &nbsp;I'm
running these options and cache settings:</font>
<br>
<br><font size=2 face="sans-serif">LARGE=&quot;-stat 2800 -dcache 2400
-daemons 5 -volumes 128&quot;</font>
<br>
<br><font size=2 face="sans-serif">I had been running with 'medium' settings,
and that's when I saw the write errors... &nbsp;now I just see the 'Lost
contact' errors, and 'failed to store file' in the program writing files.
(we're compiling/linking at the time these errors occur).</font>
<br>
<br><font size=2 face="sans-serif">I've got the cache size 'set':</font>
<br>
<br><font size=2 face="sans-serif">CACHESIZE=600000</font>
<br>
<br><font size=2 face="sans-serif">although when I cat out the cacheinfo
file I get this:</font>
<br>
<br><font size=2 face="sans-serif">cat /usr/vice/etc/cacheinfo</font>
<br><font size=2 face="sans-serif">/afs:/usr/vice/cache:3628512</font>
<br>
<br><font size=2 face="sans-serif">I'm seeing these problems both with
'kernel-smp-module-openafs-1.4.0-2.6.9_42.0.3.EL_6_rhel4' and with 'openafs-kernel-smp-1.4.2-2.6.9_42.ELsmp_1.x86_64'</font>
<br>
<br><font size=2 face="sans-serif">We had been seeing similar problems
last March on RHEL4 U3, but an 'intermediate' AFS build of 'openafs-1.4.1rc2-rhel4.0.x86_64'
seemed to work 'most of the time'... &nbsp;(we get hangs about once every
two weeks on each of 6 of the dual Opteron servers, and can't even log-into
a local console to gather info.. &nbsp;so we're not sure if this is afs
related or what).</font>
<br>
<br><font size=2 face="sans-serif">What do I do to figure this out, or
make it go away? &nbsp;Is there a 'Having problems with openafs? &nbsp;Here's
what to try....&quot; set of instructions somewhere that I've missed?</font>
<br>
<br><font size=2 face="sans-serif">Thank you very much in advance for any
help.</font>
<br><font size=2 face="sans-serif"><br>
=======================<br>
Kirby Bakken<br>
ESW Build Architect<br>
Rochester, MN<br>
email: kirbyb@us.ibm.com<br>
ezpage:kirbyb<br>
507-253-4549 / Tie: &nbsp;553-4549<br>
Fax: &nbsp;507-253-3495<br>
<br>
......one more straw can't possibly matter....<br>
</font>
--=_alternative 0067E52B86257225_=--