[OpenAFS-port-darwin] OpenAFS 1.6.1 on OS X 10.6/10.7/10.8: suspected deadlocks [resubmitted]

Jeffrey Altman jaltman@secure-endpoints.com
Sun, 15 Jul 2012 12:34:59 -0400


20 minutes is the idle data timeout on a non-replicated volume.  This
is the time period a client will wait for a file server to respond to 
an RPC
when the file server is instructing the client to wait for a response.

Your logging does not show the request that was issued to the file 
server
that has not been responded to and any messages sent by the file server 
in
response to the request.

The fact that Linux clients see a delay of 2 minutes when accessing the
same volume might imply that the problem is related to a failure of the
file server to communicate with the the OSX client's callback port or
that callback RPCs are deadlocking internally with the outstanding RPC
to the file server.

It would be helpful if you could obtain stack information for the OSX
system and provide tcpdump information observing the issuance of the
RPC that is not being responded to.

Jeffrey Altman


On Sunday, July 15, 2012 12:18:37 PM, Duncan S Kincaid wrote:
> We have been unable to deploy OpenAFS Client for OS X v. 1.6.1
> across OS X 10.6, 10.7, 10.8 owing to frequent spinning beach balls
> soon after user's login (where user's home directory is in AFS cell.)
>
> Observations:
> 1. spinning beach ball for *exactly* 20 minutes, then session returns to normal
> 2. can confirm user gets tokens and TGT at login (using LoginHook and LaunchAgent scripts)
> 3. ssh connections as root into the beach balling mac reveal following:
> 	a. no CPU load. All processes 'stuck' or 'sleeping' (apart from 'top', of course)
> 	b. virtually no system.log file entires from time of beach balling. at 20 minute mark lots of 'time out' errors logged.
> 	c. attempts to traverse afs directory tree results in shell locking up completely
> 4. tcpdumps show very little afs traffic**. mostly pings (rx ack) to servers/databases at rate of 2-5/sec. at the 20 minute mark, a barrage of traffic (rx data). (500-4000/sec)****. [see samples below]
> 5. local users (those whose home directories reside on computer hard drive) have never seen this problem
> 6. this problem occurs irrespective of particular user with home directory in AFS, AFS server storing his home directory, AFS cell,  and client computer OS.
> 7. 'Action At A Distance': when the beach ball spins on Macintosh, mounting the same user's AFS home from another computer (Linux box) takes 2 minutes as opposed to customary 1-2 seconds. subsequent Linux logins are normal.
> 8. though we have seen this issue with client version 1.6.1pre2, it is not *nearly* as frequent as with 1.6.1 (we do see more kernel panics at shutdown and restart with 1.6.1pre2, however).
>
> Frequency of Problems:
> 1. we find this occurs about 20% of the time a user logs in. much higher if shutdown or restart between login sessions.
> 2. we can increase the frequency of problem by having applications launch immediately at login (increase to 35-50% after reboot/shutdown).
> 3. we can increase the frequency by inducing reindexing of local drives (even with /afs on exclusion list). (increase to 35-50% after reboot/shutdown)
>
> any help/insights would be much appreciated!
> dk
>
>
> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> duncan kincaid
> cron | mit school of architecture and planning
>
> ===============
>
> SAMPLE TCPDUMPS DURING BACHBALLING AND AFTER
>
> **
> 30 sec. SAMPLE TCPDUMP DURING BEACHBALL
> 16:11:07.163852 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 2 serial 0 reason ping (65)
> 16:11:07.163933 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 115 serial 1516 reason ping response acked 116-122 nacked 115 (73)
> 16:11:08.295321 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 1 serial 0 reason ping (65)
> 16:11:08.295325 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 1 serial 0 reason ping (65)
> 16:11:08.295405 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 1517 reason ping response (65)
> 16:11:08.295436 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 1518 reason ping response (65)
> 16:11:08.843293 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 0 reason ping (1472)
> 16:11:08.843364 IP dsk.mit.edu.afs3-callback > lethe.mit.edu.afs3-fileserver:  rx ack first 1 serial 0 reason ping (620)
> 16:11:08.844388 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 1 serial 88 reason ping response (65)
> 16:11:08.844479 IP lethe.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 1 serial 52 reason ping response (65)
> 16:11:09.653625 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 2 serial 0 reason ping (65)
> 16:11:09.653714 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 33 serial 1519 reason ping response acked 34-42,44-61 nacked 33,43 (94)
> 16:11:12.935558 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 0 reason ping (1472)
> 16:11:12.936700 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 1 serial 89 reason ping response (65)
> 16:11:16.082830 IP dsk.mit.edu.afs3-callback > asclepius.mit.edu.afs3-fileserver:  rx version (29)
> 16:11:16.082848 IP dsk.mit.edu.afs3-callback > mnemosyne.mit.edu.afs3-fileserver:  rx version (29)
> 16:11:16.082862 IP dsk.mit.edu.afs3-callback > atreus.mit.edu.afs3-fileserver:  rx version (29)
> 16:11:16.082874 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx version (29)
> 16:11:16.582913 IP dsk.mit.edu.afs3-callback > lethe.mit.edu.afs3-fileserver:  rx version (29)
> 16:11:17.183666 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 2 serial 0 reason ping (65)
> .....
> etc. [183 lines omitted]
> .....
> 16:11:28.324068 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 1526 reason ping response (65)
> 16:11:28.324099 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 1527 reason ping response (65)
> 16:11:28.874018 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 1 serial 0 reason ping (65)
> 16:11:28.874098 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 83 reason ping response (65)
> 16:11:28.874976 IP lethe.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 1 serial 0 reason ping (65)
> 16:11:28.875060 IP dsk.mit.edu.afs3-callback > lethe.mit.edu.afs3-fileserver:  rx ack first 1 serial 33 reason ping response (65)
> 16:11:29.144251 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 33 serial 0 reason ping acked 34-42,44-61 nacked 33,43 (94)
> 16:11:29.145283 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 2 serial 1949 reason ping response (65)
> 16:11:30.547991 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 1 serial 0 reason ping (65)
> 16:11:30.548040 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 1 serial 84 reason ping response (65)
> 16:11:36.582295 IP dsk.mit.edu.afs3-callback > asclepius.mit.edu.afs3-fileserver:  rx version (29)
> 16:11:36.582313 IP dsk.mit.edu.afs3-callback > mnemosyne.mit.edu.afs3-fileserver:  rx version (29)
> 16:11:36.582327 IP dsk.mit.edu.afs3-callback > atreus.mit.edu.afs3-fileserver:  rx version (29)
> 16:11:36.582340 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx version (29)
> 16:11:37.082434 IP dsk.mit.edu.afs3-callback > lethe.mit.edu.afs3-fileserver:  rx version (29)
>
> ****
> 1 sec. SAMPLE TCPDUMP *AFTER* 20 MINUTE BEACHBALL
> 16:21:35.339781 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx data fs call create-file fid 537854840/13077/877784 "WebpageIcons.db-journal" StoreStatus date 2012/07/14 16:21:35 group 101 mode 644 (96)
> 16:21:35.340962 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data fs reply create-file new fid 537854840/45892/880977 (244)
> 16:21:35.341038 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 2 serial 0 reason delay (65)
> 16:21:35.341155 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx data fs call create-file fid 537854840/13077/877784 "._WebpageIcons.db-journal" StoreStatus date 2012/07/14 16:21:35 group 101 mode 666 (100)
> 16:21:35.342127 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data fs reply create-file new fid 537854840/45908/880978 (244)
> 16:21:35.342187 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 2 serial 0 reason delay (65)
> 16:21:35.342363 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx data fs call setlock fid 537854840/45908/880978 (48)
> 16:21:35.343207 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data fs reply setlock (52)
> 16:21:35.343261 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 2 serial 0 reason delay (65)
> 16:21:35.400989 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx data fs call store-data-64 fid 537854840/45908/880978 StoreStatus date 2012/07/14 16:21:35 fsync offset 0 length 4096 flen 4096 (1444)
> 16:21:35.401981 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 1 serial 10 reason ack requested acked 1 (66)
> 16:21:35.402039 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx data (1444)
> 16:21:35.402058 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx data (1356)
> 16:21:35.403055 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx ack first 3 serial 11 reason ack requested (65)
> 16:21:35.403060 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data fs reply store-data-64 (136)
> 16:21:35.403125 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 2 serial 0 reason delay (65)
> 16:21:35.403160 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx data fs call rellock fid 537854840/45908/880978 (44)
> 16:21:35.404035 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data fs reply rellock (52)
> 16:21:35.404077 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 2 serial 0 reason delay (65)
> .....
> etc. [410 lines omitted]
> .....
> 16:21:35.935591 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 4 serial 252 reason idle acked 4-25 (87)
> 16:21:35.935679 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data (1444)
> 16:21:35.935683 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data (1444)
> 16:21:35.935687 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data (1444)
> 16:21:35.935690 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data (1444)
> 16:21:35.935694 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data (1444)
> 16:21:35.935697 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data (1444)
> 16:21:35.935738 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 4 serial 254 reason idle acked 4-27 (89)
> 16:21:35.935769 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 4 serial 256 reason idle acked 4-29 (91)
> 16:21:35.935799 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 4 serial 258 reason idle acked 4-31 (93)
> 16:21:35.936190 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data (1444)
> 16:21:35.936193 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data (1444)
> 16:21:35.936199 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data (1444)
> 16:21:35.936208 IP hades.mit.edu.afs3-fileserver > dsk.mit.edu.afs3-callback:  rx data (1444)
> 16:21:35.936249 IP dsk.mit.edu.afs3-callback > hades.mit.edu.afs3-fileserver:  rx ack first 4 serial 260 reason idle acked 4-33 (95)
>
>
>
>
>