[OpenAFS-devel] strange cache corruption on Intel Linux with 1.3.84
Sven Oehme
oehmes@de.ibm.com
Wed, 20 Jul 2005 00:08:50 +0200
This is a multipart message in MIME format.
--=_alternative 00799495C1257043_=
Content-Type: text/plain; charset="US-ASCII"
Hi ,
today we recognized a strange reproducible cache corruption bug on a linux
host with a suse linux kernel and openafs 1.3.84.
if i generate a high load on a SMP client ( 30 - 40 files written
simultaneous into a single volume ) after some minutes under a load (cpu)
of 10 -12 on a 2 CPU (4 with Hyperthreading) system i lost contact to the
afs server, with the volume i write the files into. in a tcpdump i see
that the client requests (short before the lost contact) a fetch-status
call several times and the server send an ABORT half a second after. now i
still be able to browse on volumes that are located on a different server
, but even with a fs checkvol, fs checkserver, what ever i am not able to
chdir into any volume that resides on that server. even a afs-client
restart doesn't fix that problem. the only solution is to stop the client;
rm -Rf /afs_cache; afs-client start . i tried to reproduce the same error
on a kernel 2.4 system , without any luck. even after 2 hours stress test
(cpu load of 10 -20 ) i only have some short (2-5 seconds) hangs, i also
see at the same time on tcpdump the fetch-status request of the client and
the ABORT response from the Server, but the client recovers (in dmesg i
see lost connection to server and a few second later fileserver xyz server
is back up) .
the workaround for us is now to use the 2.4 kernel , but i assume, this
should be fixed before 1.4 ...
if somebody is interested i can provide a compressed tcpdump (5 mb on 2.4
kernel and 9 mb on the sles9 kernel test).
we will try to deeper debug into the problem tomorrow. now i need some
sleep, and any good idea, where to look at would be appreciated ... :-)
btw. client and server are connected via gigabit ethernet and the server
has fibre drives. during the test the client generates a load of 10 -20
mb/sec . the server is started -p 32 -L if that makes some difference ,
the client is started with -stat 4000 -dcache 4000 -daemons 6 -volumes 256
-chunksize 17 -nosettime .
client has 1 GB RAM , server has 4 GB of ram ..
Sven
--=_alternative 00799495C1257043_=
Content-Type: text/html; charset="US-ASCII"
<br><font size=2 face="sans-serif">Hi , </font>
<br>
<br><font size=2 face="sans-serif">today we recognized a strange reproducible
cache corruption bug on a linux host with a suse linux kernel and openafs
1.3.84.</font>
<br>
<br><font size=2 face="sans-serif">if i generate a high load on a SMP client
( 30 - 40 files written simultaneous into a single volume ) after some
minutes under a load (cpu) of 10 -12 on a 2 CPU (4 with Hyperthreading)
system i lost contact to the afs server, with the volume i write the files
into. in a tcpdump i see that the client requests (short before the lost
contact) a fetch-status call several times and the server send an ABORT
half a second after. now i still be able to browse on volumes that are
located on a different server , but even with a fs checkvol, fs checkserver,
what ever i am not able to chdir into any volume that resides on that server.
even a afs-client restart doesn't fix that problem. the only solution is
to stop the client; rm -Rf /afs_cache; afs-client start . i tried to reproduce
the same error on a kernel 2.4 system , without any luck. even after 2
hours stress test (cpu load of 10 -20 ) i only have some short (2-5 seconds)
hangs, i also see at the same time on tcpdump the fetch-status request
of the client and the ABORT response from the Server, but the client recovers
(in dmesg i see lost connection to server and a few second later fileserver
xyz server is back up) .</font>
<br>
<br><font size=2 face="sans-serif">the workaround for us is now to use
the 2.4 kernel , but i assume, this should be fixed before 1.4 ...</font>
<br><font size=2 face="sans-serif">if somebody is interested i can provide
a compressed tcpdump (5 mb on 2.4 kernel and 9 mb on the sles9 kernel test).</font>
<br><font size=2 face="sans-serif">we will try to deeper debug into the
problem tomorrow. now i need some sleep, and any good idea, where to look
at would be appreciated ... :-)</font>
<br>
<br><font size=2 face="sans-serif">btw. client and server are connected
via gigabit ethernet and the server has fibre drives. during the test the
client generates a load of 10 -20 mb/sec . the server is started -p 32
-L if that makes some difference , the client is started with -stat 4000
-dcache 4000 -daemons 6 -volumes 256 -chunksize 17 -nosettime .</font>
<br>
<br><font size=2 face="sans-serif">client has 1 GB RAM , server has
4 GB of ram ..</font>
<br>
<br><font size=2 face="sans-serif">Sven</font>
--=_alternative 00799495C1257043_=--