[OpenAFS] deadlock in OpenAFS 1.4.11 (Solaris 5.10)
John Tang Boyland
boyland@cs.uwm.edu
Fri, 09 Apr 2010 16:57:37 -0500
Derrick Brashear writes:
] [...]
] John Boyland writes:
] <...>
] > ** Cache entry @ 0xa26da3f0 for 1.536875155.530.1418 [cs.uwm.edu]
] > locks: (reader_waiting, write_locked(pid:17732 at:250), 2 waiters)
] > 26532 bytes DV 197 refcnt 3
] > callback 00000000 expires 1270781645
] > 1 opens 0 writers
] > normal file
] > states (0x0)
]
] ok, that's the rdwr vnode op, so that makes some sense.
]
] > ** Cache entry @ 0xa27651d0 for 1.536875155.608.1458 [cs.uwm.edu]
] > locks: (upgrade_waiting, write_locked(pid:17679 at:66), 12 waiters)
] > 19094744 bytes DV 109 refcnt 13
] > callback 00000000 expires 1270782798
] > 0 opens 0 writers
] > normal file
] > states (0x0)
]
] and 66 is GetDCache
] > The person who generated this process says they were running aprocess
] > that was writing debugging output redirected into a file and there was a
] > loop so the file got very large.
]
] hm. did it really hang, or just get very slow?
I was told that at first things got slow, but then it stopped
altogether. They then moved to a different machine and read the file
from there. Meanwhile, on the bad machine, ANY afs file action hangs.
] > ** Cache entry @ 0xa2710018 for 1.536875155.610.1499 [cs.uwm.edu]
] > locks: (none_waiting, write_locked(pid:17889 at:250))
] > 23240 bytes DV 1 refcnt 1
] > callback 00000000 expires 1270782798
] > 1 opens 0 writers
] > normal file
] > states (0x0)
] > ** Cache entry @ 0xa26bf000 for 1.536873892.1.1 [cs.uwm.edu]
] > locks: (writer_waiting, 7 read_locks(pid:18421), 41 waiters)
] > 2048 bytes DV 8 refcnt 49
] > callback 00000000 expires 1270774475
] > 0 opens 0 writers
] > volume root
] > states (0x4), read-only
This last one (with 41 waiters) is /afs:
good% /usr/afsws/bin/fs getfid /afs
File /afs (536873892.1.1) contained in volume 536873892
(on the bad machine, I can't touch /usr/afsws of course since it is a
link into /afs)
cmdebug now shows 71 waiters -- of course it increases whenever someone
touches afs and joins the tarbaby.
] Almost certainly a dcache lock (which cmdebug doesn't show, alas)
] involved here and causing this.
]
] I suppose the other thing which may help: can you collect fstrace for this?
bad# ./fstrace setset cm -active
[ during this time I touch /afs/not-here and freeze ]
bad# ./fstrace dump cm
AFS Trace Dump -
Date: Fri Apr 9 16:50:14 2010
Found 1 logs.
Contents of log cmfx:
AFS Trace Dump - Completed
bad# ./fstrace setset cm -inactive
John Boyland