[OpenAFS] AFS performance problems at cell rl.ac.uk

Wheeler, JF (Jonathan) J.F.Wheeler@rl.ac.uk
Thu, 31 Jan 2008 11:01:15 -0000

We have recently been experiencing performance problems within our cell
such that a use can wait several minutes to open or update a file.
Checks (e.g. uptime) on the server (we only have one; we know that it is
old and slow, and its connection speed is only 100 Mbs) show no obvious
problems, but in the AFS FileLog we have many messages of the form:

Thu Jan 31 08:53:59 2008 CB: Call back connect back failed (in break
delayed) for 83eafe37.7001
Thu Jan 31 08:53:59 2008 BreakDelayedCallbacks FAILED for host 83eafe37
which IS UP.  Possible network or routing failure.
Thu Jan 31 08:53:59 2008 MultiProbe failed to find new address for

Having extracted the records for today (31/1/2008) up to 8:53, I find
there are 8897 records.  Looking at the hostnames and converting them to
IP addresses and names, I get the following list:

83ea6a26 =3D =3D barrow.math.uni-paderborn.de
83ea6ab6 =3D =3D reynolds.math.uni-paderborn.de
83ea6ac6 =3D =3D fitting.math.uni-paderborn.de
83ea6c23 =3D =3D pizza.math.uni-paderborn.de
83ea70cb =3D =3D yang.ifim.uni-paderborn.de
83eafe33 =3D =3D albert.et.uni-paderborn.de
83eafe36 =3D =3D ysabell.et.uni-paderborn.de
83eafe37 =3D =3D ridcully.et.uni-paderborn.de
83eafe38 =3D =3D esme.et.uni-paderborn.de
83eafe39 =3D =3D gytha.et.uni-paderborn.de
83eafe3a =3D =3D magrat.et.uni-paderborn.de
83eafe3b =3D =3D teppic.et.uni-paderborn.de
83eafe3d =3D =3D schelter.et.uni-paderborn.de
83eafe40 =3D =3D detritus.et.uni-paderborn.de
83eafe41 =3D =3D colon.et.uni-paderborn.de
83eafe42 =3D =3D nobbs.et.uni-paderborn.de
83eafe43 =3D =3D poons.et.uni-paderborn.de
83eafe45 =3D =3D quirm.et.uni-paderborn.de
83eafe48 =3D =3D champot.et.uni-paderborn.de

Each of these addresses is reference more than 100 times, 3 of them more
than 1000 times (just today).  Is anyone from this cell on the list ?
If so, please would they contact me to see if we can resolve this
problem ?  In the past it has been traced to either a site router
problem or a firewall that has closed access to port 7001.  Any comments
from other people on the list would also be helpful.

Jonathan Wheeler
e-Science Centre
Rutherford Appleton Laboratory