[OpenAFS] Heavy performance loss on gigabit ethernet

Systems Administration sysadmin@contrailservices.com
Thu, 12 Aug 2004 17:08:32 -0600


> Now, the test between 1000Mb/s and 100Mb/s machines:
> | [ensc@A]$ iperf -c C -b 1000M -ud
> | [ ID] Interval       Transfer     Bandwidth
> | [  4]  0.0-10.0 sec   622 MBytes   522 Mbits/sec
> | [  3]  0.0-10.0 sec   114 MBytes  95.6 Mbits/sec  0.297 ms    
> 0/81287 (0%)
> | [  4] Server Report:
> | [  4]  0.0-10.2 sec   114 MBytes  93.5 Mbits/sec  15.142 ms 
> 362232/443628 (82%)
> | [  4] Sent 443628 datagrams
>
> This is expected: server A sends with full gigabit-speed and lots of
> UDP packages will be dropped as client is 100Mb/s only. Therefore, the
> network itself seems to be ok.

How much do you lose when you test at the 100Mb speed of the client - 
if you cant get 100% at the maximum speed of the client then there 
might be an issue there.

> As this test corresponds to the slow AFS performance (fileserver A 
> sends
> large file to client C), something must be wrong with AFS.

This may be related to my problem with clients hanging - I had 
contemplated this previously however discarded it since the AFS 
protocol should recover from a bad or missing UDP packet but Enrico's 
question begs the fact - how does the AFS protocol recover when the 
pipe from server to client is lossy?  Is the client responsible for 
recovering - and could a maladjusted network segment that drops a high 
percentage of packets be responsible?

I have been trying to figure out how to engage 802.3 flow control on 
the segment between my Gigabit backbone and the clients that are 
experiencing hangups but I believe that one of my switches is not able 
to support back-pressure and as such the server seems to flood over the 
bandwidth available causing a critical loss of synchronization between 
the endpoints.  Similar fubars are occuring with other UDP protocols 
which suggest a common cause.

I'll experiment with forcing the network to a unified 100MB speed and 
report back - in the mean time can any of the wizards here comment on 
whether this is something that could be investigated - and suggestions 
on where to look in the debug logs and code would be helpful.  This 
thread could also be thrown over to the -devel list if you think 
appropriate and not a waste of time.

Ted