[OpenAFS] Heavy performance loss on gigabit ethernet

Thu, 12 Aug 2004 20:08:56 -0600

> This may be related to my problem with clients hanging - I had 
> contemplated this previously however discarded it since the AFS 
> protocol should recover from a bad or missing UDP packet but Enrico's 
> question begs the fact - how does the AFS protocol recover when the 
> pipe from server to client is lossy?  Is the client responsible for 
> recovering - and could a maladjusted network segment that drops a high 
> percentage of packets be responsible?

I took the affected network segments and forced them to a uniform 100M 
fdx setting and have greatly improved the stability but there is still 
a problem with any client connected to the servers through the copper 
100M segment.  I have isolated the client hangs - buy replacing the 
fileserver binary with the LWP version I can get all 3 of the 
fileserver machines which are attached to the fiber switch to perform 
apparently flawless transactions between each other as clients or AFS 
volume servers.

 From this gigabit switch I have a 100Mb copper switch (unmanaged) 
leading to the individual workstations.  Any client attached to this 
unmanaged switch will hangup performing transactions to any of the 3 
fileservers.  Simultaneous access to the same volume by a machine on 
the fiber backbone and a client on the 100M copper segment proves that 
the fileserver binary itself is not hung-up, and other 100Mb clients 
can also access the same volume at the same time so this is not an 
isolation/segmentation issue.  There are no circular paths or 
spanning-tree issues.

I ran ping floods and iperf traces across the network and there is no 
packet loss between the fiber attached machines, and once I forced the 
fiber attached interfaces to operate at 100Mb the percentage of packet 
loss goes down on copper connected clients from 20% to 0.5%.  This is 
strongly correlated to the improved perfomance as it now takes larger 
transactions to cause the client to hangup - small files will pass, and 
small batch write/touch/delete actions will usually complete without a 
hangup, however large batches such as a recursive delete with more than 
20 files in a directory will hangup without fail.

Tracing the system calls in a '/bin/rm -rvf <dirname>' call shows that 
the client hangs while attempting to open the directory entry for 
reading.

As far as I can tell this seems to be a sign of a foobared state 
machine that misses some kind of synchronization between the client and 
server - most likely caused by the network segment issue trashing some 
of the packets from the server on the gigabit segment as they pass to 
the slower copper segment that lacks 802.3 flow control.

Logs available at this URL:
http://cumulonimbus.contrailservices.com/ted/hanglog_081204.txt