[OpenAFS-devel] strange cache corruption on Intel Linux with 1.3.84

Sven Oehme oehmes@de.ibm.com
Wed, 20 Jul 2005 00:08:50 +0200


This is a multipart message in MIME format.
--=_alternative 00799495C1257043_=
Content-Type: text/plain; charset="US-ASCII"

Hi , 

today we recognized a strange reproducible cache corruption bug on a linux 
host with a suse linux kernel and openafs 1.3.84.

if i generate a high load on a SMP client ( 30 - 40 files written 
simultaneous into a single volume ) after some minutes under a load (cpu) 
of 10 -12 on a 2 CPU (4 with Hyperthreading) system i lost contact to the 
afs server, with the volume i write the files into. in a tcpdump i see 
that the client requests (short before the lost contact) a fetch-status 
call several times and the server send an ABORT half a second after. now i 
still be able to browse on volumes that are located on a different server 
, but even with a fs checkvol, fs checkserver, what ever i am not able to 
chdir into any volume that resides on that server. even a afs-client 
restart doesn't fix that problem. the only solution is to stop the client; 
rm -Rf /afs_cache; afs-client start . i tried to reproduce the same error 
on a kernel 2.4 system , without any luck. even after 2 hours stress test 
(cpu load of 10 -20 ) i only have some short (2-5 seconds) hangs, i also 
see at the same time on tcpdump the fetch-status request of the client and 
the ABORT response from the Server, but the client recovers (in dmesg i 
see lost connection to server and a few second later fileserver xyz server 
is back up) .

the workaround for us is now to use the 2.4 kernel , but i assume, this 
should be fixed before 1.4 ...
if somebody is interested i can provide a compressed tcpdump (5 mb on 2.4 
kernel and 9 mb on the sles9 kernel test).
we will try to deeper debug into the problem tomorrow. now i need some 
sleep, and any good idea, where to look at would be appreciated ... :-)

btw. client and server are connected via gigabit ethernet and the server 
has fibre drives. during the test the client generates a load of 10 -20 
mb/sec . the server is started -p 32 -L if that makes some difference , 
the client is started with -stat 4000 -dcache 4000 -daemons 6 -volumes 256 
-chunksize 17 -nosettime .

client has 1 GB RAM ,  server has 4 GB of ram  ..

Sven
--=_alternative 00799495C1257043_=
Content-Type: text/html; charset="US-ASCII"


<br><font size=2 face="sans-serif">Hi , </font>
<br>
<br><font size=2 face="sans-serif">today we recognized a strange reproducible
cache corruption bug on a linux host with a suse linux kernel and openafs
1.3.84.</font>
<br>
<br><font size=2 face="sans-serif">if i generate a high load on a SMP client
( 30 - 40 files written simultaneous into a single volume ) after some
minutes under a load (cpu) of 10 -12 on a 2 CPU (4 with Hyperthreading)
system i lost contact to the afs server, with the volume i write the files
into. in a tcpdump i see that the client requests (short before the lost
contact) a fetch-status call several times and the server send an ABORT
half a second after. now i still be able to browse on volumes that are
located on a different server , but even with a fs checkvol, fs checkserver,
what ever i am not able to chdir into any volume that resides on that server.
even a afs-client restart doesn't fix that problem. the only solution is
to stop the client; rm -Rf /afs_cache; afs-client start . i tried to reproduce
the same error on a kernel 2.4 system , without any luck. even after 2
hours stress test (cpu load of 10 -20 ) i only have some short (2-5 seconds)
hangs, i also see at the same time on tcpdump the fetch-status request
of the client and the ABORT response from the Server, but the client recovers
(in dmesg i see lost connection to server and a few second later fileserver
xyz server is back up) .</font>
<br>
<br><font size=2 face="sans-serif">the workaround for us is now to use
the 2.4 kernel , but i assume, this should be fixed before 1.4 ...</font>
<br><font size=2 face="sans-serif">if somebody is interested i can provide
a compressed tcpdump (5 mb on 2.4 kernel and 9 mb on the sles9 kernel test).</font>
<br><font size=2 face="sans-serif">we will try to deeper debug into the
problem tomorrow. now i need some sleep, and any good idea, where to look
at would be appreciated ... :-)</font>
<br>
<br><font size=2 face="sans-serif">btw. client and server are connected
via gigabit ethernet and the server has fibre drives. during the test the
client generates a load of 10 -20 mb/sec . the server is started -p 32
-L if that makes some difference , the client is started with -stat 4000
-dcache 4000 -daemons 6 -volumes 256 -chunksize 17 -nosettime .</font>
<br>
<br><font size=2 face="sans-serif">client has 1 GB RAM , &nbsp;server has
4 GB of ram &nbsp;..</font>
<br>
<br><font size=2 face="sans-serif">Sven</font>
--=_alternative 00799495C1257043_=--