[OpenAFS] Heavy performance loss on gigabit ethernet
Derrick J Brashear
shadow@dementia.org
Mon, 16 Aug 2004 15:45:27 -0400 (EDT)
On Thu, 12 Aug 2004, Theodore F Vaida wrote:
> I took the affected network segments and forced them to a uniform 100M fdx
> setting and have greatly improved the stability but there is still a problem
> with any client connected to the servers through the copper 100M segment. I
> have isolated the client hangs - buy replacing the fileserver binary with the
> LWP version I can get all 3 of the fileserver machines which are attached to
> the fiber switch to perform apparently flawless transactions between each
> other as clients or AFS volume servers.
Well, that does imply that you have 2 sets of problems, one
thread-related, and one network-related.
> As far as I can tell this seems to be a sign of a foobared state machine that
> misses some kind of synchronization between the client and server - most
> likely caused by the network segment issue trashing some of the packets from
> the server on the gigabit segment as they pass to the slower copper segment
> that lacks 802.3 flow control.
>
> Logs available at this URL:
> http://cumulonimbus.contrailservices.com/ted/hanglog_081204.txt
The fileserver tcpdump would be more useful at a higher verbosity (tcpdump
-s (whatever the mtu is) -vv, particularly, will tell you what RPCs are
happening)
The fstrace data gives no indication anything is hung. There's no large
time gap between anything we see.
About the only thing I can think of which is relevant is some time ago,
Chaskiel observed a problem here where if the latency was too *low*, some
calculation in Rx seemed to not work correctly. Sadly I don't
remember/can't find the details now. I don't know that we ever addressed
it. But I'd expect that's not your problem anyway: it works from other
machines on a similar speed network.