[OpenAFS-port-darwin] OpenAFS 1.6.1 on OS X 10.6/10.7/10.8: suspected deadlocks [resubmitted]

Duncan S Kincaid dsk@MIT.EDU
Sun, 15 Jul 2012 16:18:37 +0000


--Apple-Mail-17--1010302619
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

We have been unable to deploy OpenAFS Client for OS X v. 1.6.1
across OS X 10.6, 10.7, 10.8 owing to frequent spinning beach balls
soon after user's login (where user's home directory is in AFS cell.)

Observations:
1. spinning beach ball for *exactly* 20 minutes, then session returns to =
normal
2. can confirm user gets tokens and TGT at login (using LoginHook and =
LaunchAgent scripts)
3. ssh connections as root into the beach balling mac reveal following:
	a. no CPU load. All processes 'stuck' or 'sleeping' (apart from =
'top', of course)
	b. virtually no system.log file entires from time of beach =
balling. at 20 minute mark lots of 'time out' errors logged.
	c. attempts to traverse afs directory tree results in shell =
locking up completely
4. tcpdumps show very little afs traffic**. mostly pings (rx ack) to =
servers/databases at rate of 2-5/sec. at the 20 minute mark, a barrage =
of traffic (rx data). (500-4000/sec)****. [see samples below]
5. local users (those whose home directories reside on computer hard =
drive) have never seen this problem
6. this problem occurs irrespective of particular user with home =
directory in AFS, AFS server storing his home directory, AFS cell,  and =
client computer OS.
7. 'Action At A Distance': when the beach ball spins on Macintosh, =
mounting the same user's AFS home from another computer (Linux box) =
takes 2 minutes as opposed to customary 1-2 seconds. subsequent Linux =
logins are normal.
8. though we have seen this issue with client version 1.6.1pre2, it is =
not *nearly* as frequent as with 1.6.1 (we do see more kernel panics at =
shutdown and restart with 1.6.1pre2, however).

Frequency of Problems:
1. we find this occurs about 20% of the time a user logs in. much higher =
if shutdown or restart between login sessions.
2. we can increase the frequency of problem by having applications =
launch immediately at login (increase to 35-50% after reboot/shutdown).
3. we can increase the frequency by inducing reindexing of local drives =
(even with /afs on exclusion list). (increase to 35-50% after =
reboot/shutdown)

any help/insights would be much appreciated!
dk


=
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||=
|||||||
duncan kincaid
cron | mit school of architecture and planning

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

SAMPLE TCPDUMPS DURING BACHBALLING AND AFTER

**
30 sec. SAMPLE TCPDUMP DURING BEACHBALL
16:11:07.163852 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 2 serial 0 reason ping (65)
16:11:07.163933 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 115 serial 1516 reason ping =
response acked 116-122 nacked 115 (73)
16:11:08.295321 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 1 serial 0 reason ping (65)
16:11:08.295325 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 1 serial 0 reason ping (65)
16:11:08.295405 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 1517 reason ping =
response (65)
16:11:08.295436 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 1518 reason ping =
response (65)
16:11:08.843293 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 0 reason ping =
(1472)
16:11:08.843364 IP dsk.mit.edu.afs3-callback > =
lethe.mit.edu.afs3-fileserver:  rx ack first 1 serial 0 reason ping =
(620)
16:11:08.844388 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 1 serial 88 reason ping =
response (65)
16:11:08.844479 IP lethe.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 1 serial 52 reason ping =
response (65)
16:11:09.653625 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 2 serial 0 reason ping (65)
16:11:09.653714 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 33 serial 1519 reason ping =
response acked 34-42,44-61 nacked 33,43 (94)
16:11:12.935558 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 0 reason ping =
(1472)
16:11:12.936700 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 1 serial 89 reason ping =
response (65)
16:11:16.082830 IP dsk.mit.edu.afs3-callback > =
asclepius.mit.edu.afs3-fileserver:  rx version (29)
16:11:16.082848 IP dsk.mit.edu.afs3-callback > =
mnemosyne.mit.edu.afs3-fileserver:  rx version (29)
16:11:16.082862 IP dsk.mit.edu.afs3-callback > =
atreus.mit.edu.afs3-fileserver:  rx version (29)
16:11:16.082874 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx version (29)
16:11:16.582913 IP dsk.mit.edu.afs3-callback > =
lethe.mit.edu.afs3-fileserver:  rx version (29)
16:11:17.183666 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 2 serial 0 reason ping (65)
.....
etc. [183 lines omitted]
.....
16:11:28.324068 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 1526 reason ping =
response (65)
16:11:28.324099 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 1527 reason ping =
response (65)
16:11:28.874018 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 1 serial 0 reason ping (65)
16:11:28.874098 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 83 reason ping =
response (65)
16:11:28.874976 IP lethe.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 1 serial 0 reason ping (65)
16:11:28.875060 IP dsk.mit.edu.afs3-callback > =
lethe.mit.edu.afs3-fileserver:  rx ack first 1 serial 33 reason ping =
response (65)
16:11:29.144251 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 33 serial 0 reason ping =
acked 34-42,44-61 nacked 33,43 (94)
16:11:29.145283 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 2 serial 1949 reason ping =
response (65)
16:11:30.547991 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 1 serial 0 reason ping (65)
16:11:30.548040 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 84 reason ping =
response (65)
16:11:36.582295 IP dsk.mit.edu.afs3-callback > =
asclepius.mit.edu.afs3-fileserver:  rx version (29)
16:11:36.582313 IP dsk.mit.edu.afs3-callback > =
mnemosyne.mit.edu.afs3-fileserver:  rx version (29)
16:11:36.582327 IP dsk.mit.edu.afs3-callback > =
atreus.mit.edu.afs3-fileserver:  rx version (29)
16:11:36.582340 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx version (29)
16:11:37.082434 IP dsk.mit.edu.afs3-callback > =
lethe.mit.edu.afs3-fileserver:  rx version (29)

****
1 sec. SAMPLE TCPDUMP *AFTER* 20 MINUTE BEACHBALL
16:21:35.339781 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx data fs call create-file fid =
537854840/13077/877784 "WebpageIcons.db-journal" StoreStatus date =
2012/07/14 16:21:35 group 101 mode 644 (96)
16:21:35.340962 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data fs reply create-file new fid =
537854840/45892/880977 (244)
16:21:35.341038 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 2 serial 0 reason delay =
(65)
16:21:35.341155 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx data fs call create-file fid =
537854840/13077/877784 "._WebpageIcons.db-journal" StoreStatus date =
2012/07/14 16:21:35 group 101 mode 666 (100)
16:21:35.342127 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data fs reply create-file new fid =
537854840/45908/880978 (244)
16:21:35.342187 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 2 serial 0 reason delay =
(65)
16:21:35.342363 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx data fs call setlock fid =
537854840/45908/880978 (48)
16:21:35.343207 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data fs reply setlock (52)
16:21:35.343261 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 2 serial 0 reason delay =
(65)
16:21:35.400989 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx data fs call store-data-64 fid =
537854840/45908/880978 StoreStatus date 2012/07/14 16:21:35 fsync offset =
0 length 4096 flen 4096 (1444)
16:21:35.401981 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 1 serial 10 reason ack =
requested acked 1 (66)
16:21:35.402039 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx data (1444)
16:21:35.402058 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx data (1356)
16:21:35.403055 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx ack first 3 serial 11 reason ack =
requested (65)
16:21:35.403060 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data fs reply store-data-64 (136)
16:21:35.403125 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 2 serial 0 reason delay =
(65)
16:21:35.403160 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx data fs call rellock fid =
537854840/45908/880978 (44)
16:21:35.404035 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data fs reply rellock (52)
16:21:35.404077 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 2 serial 0 reason delay =
(65)
.....
etc. [410 lines omitted]
.....
16:21:35.935591 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 4 serial 252 reason idle =
acked 4-25 (87)
16:21:35.935679 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data (1444)
16:21:35.935683 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data (1444)
16:21:35.935687 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data (1444)
16:21:35.935690 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data (1444)
16:21:35.935694 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data (1444)
16:21:35.935697 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data (1444)
16:21:35.935738 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 4 serial 254 reason idle =
acked 4-27 (89)
16:21:35.935769 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 4 serial 256 reason idle =
acked 4-29 (91)
16:21:35.935799 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 4 serial 258 reason idle =
acked 4-31 (93)
16:21:35.936190 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data (1444)
16:21:35.936193 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data (1444)
16:21:35.936199 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data (1444)
16:21:35.936208 IP hades.mit.edu.afs3-fileserver > =
dsk.mit.edu.afs3-callback:  rx data (1444)
16:21:35.936249 IP dsk.mit.edu.afs3-callback > =
hades.mit.edu.afs3-fileserver:  rx ack first 4 serial 260 reason idle =
acked 4-33 (95)






--Apple-Mail-17--1010302619
Content-Disposition: attachment; filename="smime.p7s"
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIDOzCCAzcw
ggKgoAMCAQICEFjj2TdkiNlbe52rM4gAfTowDQYJKoZIhvcNAQEFBQAwbDELMAkGA1UEBhMCVVMx
FjAUBgNVBAgTDU1hc3NhY2h1c2V0dHMxLjAsBgNVBAoTJU1hc3NhY2h1c2V0dHMgSW5zdGl0dXRl
IG9mIFRlY2hub2xvZ3kxFTATBgNVBAsTDENsaWVudCBDQSB2MTAeFw0xMTA3MzExMzAxMTVaFw0x
MjA3MzExMzAxMTVaMIGjMQswCQYDVQQGEwJVUzEWMBQGA1UECBMNTWFzc2FjaHVzZXR0czEuMCwG
A1UEChMlTWFzc2FjaHVzZXR0cyBJbnN0aXR1dGUgb2YgVGVjaG5vbG9neTEVMBMGA1UECxMMQ2xp
ZW50IENBIHYxMRkwFwYDVQQDExBEdW5jYW4gUyBLaW5jYWlkMRowGAYJKoZIhvcNAQkBFgtkc2tA
TUlULkVEVTCBnzANBgkqhkiG9w0BAQEFAAOBjQAwgYkCgYEApo+T7gVsGtZunTev59LAKvGTLPr2
u/TEqJe8mgne+RnYeAKakwbLSxAScgy7s/COZ5hS4dMbABvNePBgHP6H89T8+AhXlpwzi973TyQZ
1r8neWR+apLPrJFdpbx6nbuBUd+hXaGu4OrkDLKPDS5YIXCZOa6u2SH7ACXe1gkfo7MCAwEAAaOB
oTCBnjAJBgNVHRMEAjAAMBEGCWCGSAGG+EIBAQQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDBAYI
KwYBBQUHAwIwCwYDVR0PBAQDAgXgMB0GA1UdDgQWBBQv3WEPxE4Pg7BO2VGm6o71Pf1xezAzBgNV
HR8ELDAqMCigJqAkhiJodHRwOi8vY2EubWl0LmVkdS9jYS9taXRjbGllbnQuY3JsMA0GCSqGSIb3
DQEBBQUAA4GBAKyaOu2fV0mDzhNg2e8zC19p/WaNypNngBd0uehFCXjiMievHoLFTrD3ArhLHvfS
N+o3ZWBQoOJ4L1tIgNFy2TQVafDvw/2JJPo4ox62hh4x3IBn56py1cBGCbHu7tTLxohW55zUvj+G
i8jx2x+D2IsS2xziSJiLw1087udW9P9vMYICsjCCAq4CAQEwgYAwbDELMAkGA1UEBhMCVVMxFjAU
BgNVBAgTDU1hc3NhY2h1c2V0dHMxLjAsBgNVBAoTJU1hc3NhY2h1c2V0dHMgSW5zdGl0dXRlIG9m
IFRlY2hub2xvZ3kxFTATBgNVBAsTDENsaWVudCBDQSB2MQIQWOPZN2SI2Vt7nasziAB9OjAJBgUr
DgMCGgUAoIIBhzAYBgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xMjA3
MTUxNjE4MzdaMCMGCSqGSIb3DQEJBDEWBBR9r1GufRVxbBjI/JCUBbDSFqvrbTCBkQYJKwYBBAGC
NxAEMYGDMIGAMGwxCzAJBgNVBAYTAlVTMRYwFAYDVQQIEw1NYXNzYWNodXNldHRzMS4wLAYDVQQK
EyVNYXNzYWNodXNldHRzIEluc3RpdHV0ZSBvZiBUZWNobm9sb2d5MRUwEwYDVQQLEwxDbGllbnQg
Q0EgdjECEFjj2TdkiNlbe52rM4gAfTowgZMGCyqGSIb3DQEJEAILMYGDoIGAMGwxCzAJBgNVBAYT
AlVTMRYwFAYDVQQIEw1NYXNzYWNodXNldHRzMS4wLAYDVQQKEyVNYXNzYWNodXNldHRzIEluc3Rp
dHV0ZSBvZiBUZWNobm9sb2d5MRUwEwYDVQQLEwxDbGllbnQgQ0EgdjECEFjj2TdkiNlbe52rM4gA
fTowDQYJKoZIhvcNAQEBBQAEgYCiSLe1Efgn2jCXYQVuhgmW5zv9Jbg4WUp8LtjuqwV3qAOxCIAG
56/AbB+5Z0eicq/RrTVejaZkSQ3tLGvU9otO9CQ0g5n6fviAyzkhy4uihBHEu5eTDFT9rAW7V5Rl
0eS9zc5jbWEpk8jAiidqfyPRmmfKlziEh/EHPhPIXbD9yQAAAAAAAA==

--Apple-Mail-17--1010302619--