[OpenAFS] OpenAFS freeze problems

John Tang Boyland boyland@uwm.edu
Mon, 13 Feb 2012 12:31:53 -0600


About every few hours or so, AFS "freezes" on a write:
the attempt to write blocks for about 30 seconds or so.
The file server log has entries like:

Tue Dec 20 21:56:41 2011 FindClient: stillborn client 87649a0(1355d694); conn 87c0e18 (host 75.9.160.AAA:7001) had client 875b1a0(1355d694)
Tue Dec 20 21:56:41 2011 FindClient: stillborn client 875a438(1355d694); conn 87c0e18 (host 75.9.160.AAA:7001) had client 875b1a0(1355d694)
Tue Dec 20 21:57:34 2011 CB: RCallBackConnectBack failed for host 8635cc8 (76.199.152.BBB:57319)
Tue Dec 20 21:57:34 2011 CB: Call back connect back failed (in break delayed) for Host 76.199.152.BBB:57319
Tue Dec 20 21:57:34 2011 BreakDelayedCallbacks FAILED for host 76.199.152.BBB:57319 which IS UP.  Connection from 76.199.152.BBB:57319.  Possible network or routing failure.
Tue Dec 20 21:57:34 2011 MultiProbe failed to find new address for host 76.199.152.BBB:57319
Tue Dec 20 21:57:35 2011 CB: ProbeUuid for 184.158.83.CCC:12391 failed -01
Tue Dec 20 21:58:10 2011 CB: ProbeUuid for 129.89.10.DDD:44892 failed -01
Tue Dec 20 21:59:49 2011 CB: ProbeUuid for 75.35.48.EEE:63818 failed -01
Tue Dec 20 22:00:26 2011 CB: ProbeUuid for 72.135.206.FFF:5533 failed -01
Tue Dec 20 22:01:41 2011 CB: ProbeUuid for 184.158.83.CCC:12394 failed -01
Tue Dec 20 22:02:11 2011 CB: ProbeUuid for 129.89.10.DDD:46369 failed -01
Tue Dec 20 22:03:55 2011 CB: ProbeUuid for 75.35.48.EEE:58880 failed -01
Tue Dec 20 22:04:34 2011 CB: ProbeUuid for 72.135.206.FFF:5535 failed -01
Tue Dec 20 22:05:49 2011 CB: ProbeUuid for 184.158.83.CCC:12398 failed -01
Tue Dec 20 22:06:12 2011 CB: ProbeUuid for 129.89.10.DDD:30017 failed -01
Tue Dec 20 22:06:13 2011 FindClient: stillborn client 873cea0(e384ead8); conn 8812df0 (host 24.211.21.GGG:7001) had client 873f640(e384ead8)
Tue Dec 20 22:08:01 2011 CB: ProbeUuid for 75.35.48.EEE:62476 failed -01
Tue Dec 20 22:08:41 2011 CB: ProbeUuid for 72.135.206.FFF:5536 failed -01
Tue Dec 20 22:09:58 2011 CB: ProbeUuid for 184.158.83.CCC:12399 failed -01
Tue Dec 20 22:10:13 2011 CB: ProbeUuid for 129.89.10.DDD:19805 failed -01
Tue Dec 20 22:12:08 2011 CB: ProbeUuid for 75.35.48.EEE:55235 failed -01
Tue Dec 20 22:12:48 2011 CB: ProbeUuid for 72.135.206.FFF:5537 failed -01
Tue Dec 20 22:14:04 2011 CB: ProbeUuid for 184.158.83.CCC:12400 failed -01
Tue Dec 20 22:14:14 2011 CB: ProbeUuid for 129.89.10.DDD:32664 failed -01
Tue Dec 20 22:16:14 2011 CB: ProbeUuid for 75.35.48.EEE:58039 failed -01
Tue Dec 20 22:16:56 2011 CB: ProbeUuid for 72.135.206.FFF:5539 failed -01

(I've replaced the last byte of the IP addresses with AAA-GGG)

(The freeze was about 22:10)
This is a Solaris 10 host running 1.4.12.

The same freeze behavior is also noticeable on a 
linux computer running 1.6.0.

This cell has about 40 students on it accessing files on three
servers using their laptops which probably have firewalls causing them
to ignore callback requests.  Unless the OpenAFS installation process
opens up 7001 to outside access, there's basically nothing I can do about 
this bad behavior.  And even if the firewall was right, the
laptop can be closed/offline at any time.

My guess is that the server's threads all get used up waiting for
callback breaks to be ack'ed and so the fileserver stops responding.
But is there something more I can do to find out why the freeze is
happening?  Is there some rxdebug command that I can run when a freeze
happens?

Is there a simple solution -- like tuning a parameter (more threads?)
that could make this behavior less common?

Thanks,
John