[OpenAFS] Heavy performance loss on gigabit ethernet
Systems Administration
sysadmin@contrailservices.com
Thu, 12 Aug 2004 20:08:56 -0600
> This may be related to my problem with clients hanging - I had
> contemplated this previously however discarded it since the AFS
> protocol should recover from a bad or missing UDP packet but Enrico's
> question begs the fact - how does the AFS protocol recover when the
> pipe from server to client is lossy? Is the client responsible for
> recovering - and could a maladjusted network segment that drops a high
> percentage of packets be responsible?
I took the affected network segments and forced them to a uniform 100M
fdx setting and have greatly improved the stability but there is still
a problem with any client connected to the servers through the copper
100M segment. I have isolated the client hangs - buy replacing the
fileserver binary with the LWP version I can get all 3 of the
fileserver machines which are attached to the fiber switch to perform
apparently flawless transactions between each other as clients or AFS
volume servers.
From this gigabit switch I have a 100Mb copper switch (unmanaged)
leading to the individual workstations. Any client attached to this
unmanaged switch will hangup performing transactions to any of the 3
fileservers. Simultaneous access to the same volume by a machine on
the fiber backbone and a client on the 100M copper segment proves that
the fileserver binary itself is not hung-up, and other 100Mb clients
can also access the same volume at the same time so this is not an
isolation/segmentation issue. There are no circular paths or
spanning-tree issues.
I ran ping floods and iperf traces across the network and there is no
packet loss between the fiber attached machines, and once I forced the
fiber attached interfaces to operate at 100Mb the percentage of packet
loss goes down on copper connected clients from 20% to 0.5%. This is
strongly correlated to the improved perfomance as it now takes larger
transactions to cause the client to hangup - small files will pass, and
small batch write/touch/delete actions will usually complete without a
hangup, however large batches such as a recursive delete with more than
20 files in a directory will hangup without fail.
Tracing the system calls in a '/bin/rm -rvf <dirname>' call shows that
the client hangs while attempting to open the directory entry for
reading.
As far as I can tell this seems to be a sign of a foobared state
machine that misses some kind of synchronization between the client and
server - most likely caused by the network segment issue trashing some
of the packets from the server on the gigabit segment as they pass to
the slower copper segment that lacks 802.3 flow control.
Logs available at this URL:
http://cumulonimbus.contrailservices.com/ted/hanglog_081204.txt