[OpenAFS-devel] delays and lost contact with fileserver with 1.3.84 and higher

Alexander Bergolth leo@strike.wu-wien.ac.at
Fri, 28 Oct 2005 21:24:38 +0200


Hi!

I am experiencing frequent hangs and annoying delays with recent
openafs-clients. The delays appear both when reading or writing files.
Sometimes transfer contunies after some delay (about 8 to 10 seconds),
sometimes afs reports "Lost contact to fileserver".

I've seen those hangs with 1.3.84, 1.3.85, 1.4.0rc1 and rc5 clients on
Linux (Kernel 2.6). 1.3.80 and 1.3.82 work fine, so I expect that some
change between 1.3.82 and 1.3.84 causes the problems. The fileserver is
from transarc:
# rxdebug c-hoernchen -version
Trying 137.208.3.48 (port 7000):
AFS version: Base configuration afs3.4 5.77

To track down the problem, I've captured the network traffic between
client and server while creating 10 files with 100k each. The dumps are
available in pcap format (suitable for viewing with ethereal) and as
tcpdump output at

  http://leo.kloburg.at/tmp/openafs/

Maybe the files for 1.3.82 (ok) and 1.3.84 (hangs) will be relevant.

The command that produced the files was:
perl -le 'for $i ((1..20)) {  print "$i"; local *F; \
  open(F, ">f_$i") or die "open: $!"; print F "x"x10240; }'

The delays look like that:

17:25:33.967776 IP 137.208.3.48.afs3-fileserver >
137.208.89.100.afs3-callback:  rx ack first 1 serial 3151 reason ack
requested acked 16777216 (62)
17:25:41.242157 IP 137.208.89.100.afs3-callback >
137.208.3.48.afs3-fileserver:  rx data (1472)

17:25:41.305587 IP 137.208.3.48.afs3-fileserver >
137.208.89.100.afs3-callback:  rx ack first 1 serial 3162 reason ack
requested acked 16777216 (62)
17:25:51.319114 IP 137.208.3.48.afs3-fileserver >
137.208.89.100.afs3-callback:  rx ack first 2 serial 3162 reason ping (61)
17:25:51.319224 IP 137.208.89.100.afs3-callback >
137.208.3.48.afs3-fileserver:  rx ack first 1 serial 1761 reason ping
response (65)

17:25:51.377321 IP 137.208.3.48.afs3-fileserver >
137.208.89.100.afs3-callback:  rx ack first 1 serial 3168 reason ack
requested acked 16777216 (62)
17:26:01.375107 IP 137.208.3.48.afs3-fileserver >
137.208.89.100.afs3-callback:  rx ack first 2 serial 3168 reason ping (61)
17:26:01.375215 IP 137.208.89.100.afs3-callback >
137.208.3.48.afs3-fileserver:  rx ack first 1 serial 1765 reason ping
response (65)

I've also noticed that in versions 1.3.80 and 1.3.82 (those that do not
show the delays) each store-data UDP-packet is 5700 bytes and is
splitted in four UDP fragments. However, this is also true for 1.3.84,
which already shows the problems. In 1.4.0rcX, the store-data packets
seem to be smaller, the UDP packet is only 2896 bytes and comes in two
fragments. Is there any specific reason why all those packets are larger
than the MTU?

Any help would be greatly appreciated!

Cheers,
--leo
-- 
-----------------------------------------------------------------------
Alexander.Bergolth@wu-wien.ac.at                Fax: +43-1-31336-906050
Zentrum fuer Informatikdienste - Wirtschaftsuniversitaet Wien - Austria