[OpenAFS] deadlock in OpenAFS 1.4.11 (Solaris 5.10)

Fri, 09 Apr 2010 16:57:37 -0500

Derrick Brashear writes:
] [...]
] John Boyland writes:
] <...>
] > ** Cache entry @ 0xa26da3f0 for 1.536875155.530.1418 [cs.uwm.edu]
] >    locks: (reader_waiting, write_locked(pid:17732 at:250), 2 waiters)
] >           26532 bytes  DV          197  refcnt     3
] >    callback 00000000   expires 1270781645
] >    1 opens     0 writers
] >    normal file
] >    states (0x0)
] 
] ok, that's the rdwr vnode op, so that makes some sense.
] 
] > ** Cache entry @ 0xa27651d0 for 1.536875155.608.1458 [cs.uwm.edu]
] >    locks: (upgrade_waiting, write_locked(pid:17679 at:66), 12 waiters)
] >        19094744 bytes  DV          109  refcnt    13
] >    callback 00000000   expires 1270782798
] >    0 opens     0 writers
] >    normal file
] >    states (0x0)
] 
] and 66 is GetDCache

] > The person who generated this process says they were running aprocess
] > that was writing debugging output redirected into a file and there was a
] > loop so the file got very large.
]
] hm. did it really hang, or just get very slow?

I was told that at first things got slow, but then it stopped
altogether.  They then moved to a different machine and read the file
from there.  Meanwhile, on the bad machine, ANY afs file action hangs.

] > ** Cache entry @ 0xa2710018 for 1.536875155.610.1499 [cs.uwm.edu]
] >    locks: (none_waiting, write_locked(pid:17889 at:250))
] >           23240 bytes  DV            1  refcnt     1
] >    callback 00000000   expires 1270782798
] >    1 opens     0 writers
] >    normal file
] >    states (0x0)
] > ** Cache entry @ 0xa26bf000 for 1.536873892.1.1 [cs.uwm.edu]
] >    locks: (writer_waiting, 7 read_locks(pid:18421), 41 waiters)
] >            2048 bytes  DV            8  refcnt    49
] >    callback 00000000   expires 1270774475
] >    0 opens     0 writers
] >    volume root
] >    states (0x4), read-only

This last one (with 41 waiters) is /afs:

good% /usr/afsws/bin/fs getfid /afs
File /afs (536873892.1.1) contained in volume 536873892

(on the bad machine, I can't touch /usr/afsws of course since it is a
link into /afs)

cmdebug now shows 71 waiters -- of course it increases whenever someone
touches afs and joins the tarbaby.

] Almost certainly a dcache lock (which cmdebug doesn't show, alas)
] involved here and causing this.
]
] I suppose the other thing which may help: can you collect fstrace for this?

bad# ./fstrace setset cm -active

                        [ during this time I touch /afs/not-here and freeze ]

bad# ./fstrace dump cm
AFS Trace Dump -

   Date: Fri Apr  9 16:50:14 2010

Found 1 logs.

Contents of log cmfx:

AFS Trace Dump - Completed
bad# ./fstrace setset cm -inactive

John Boyland