[OpenAFS] Heavy performance loss on gigabit ethernet

Mon, 16 Aug 2004 15:45:27 -0400 (EDT)

On Thu, 12 Aug 2004, Theodore F Vaida wrote:

> I took the affected network segments and forced them to a uniform 100M fdx 
> setting and have greatly improved the stability but there is still a problem 
> with any client connected to the servers through the copper 100M segment.  I 
> have isolated the client hangs - buy replacing the fileserver binary with the 
> LWP version I can get all 3 of the fileserver machines which are attached to 
> the fiber switch to perform apparently flawless transactions between each 
> other as clients or AFS volume servers.

Well, that does imply that you have 2 sets of problems, one 
thread-related, and one network-related.

> As far as I can tell this seems to be a sign of a foobared state machine that 
> misses some kind of synchronization between the client and server - most 
> likely caused by the network segment issue trashing some of the packets from 
> the server on the gigabit segment as they pass to the slower copper segment 
> that lacks 802.3 flow control.
>
> Logs available at this URL:
> http://cumulonimbus.contrailservices.com/ted/hanglog_081204.txt

The fileserver tcpdump would be more useful at a higher verbosity (tcpdump 
-s (whatever the mtu is) -vv, particularly, will tell you what RPCs are 
happening)

The fstrace data gives no indication anything is hung. There's no large 
time gap between anything we see.

About the only thing I can think of which is relevant is some time ago, 
Chaskiel observed a problem here where if the latency was too *low*, some 
calculation in Rx seemed to not work correctly. Sadly I don't 
remember/can't find the details now. I don't know that we ever addressed 
it. But I'd expect that's not your problem anyway: it works from other 
machines on a similar speed network.