[OpenAFS] nightly failure since upgrading to 1.6.5

Peter Grandi pg@afs.list.sabi.co.UK
Mon, 10 Feb 2014 13:00:48 +0000


> Every night at midnight, we run 'vos backupsys'. For three
> nights in a row, on one of the servers I've upgraded to 1.6.5
> and dafs, I've been getting the following errors, and it
> mostly stops being a fileserver.

[ ... ]
> Sun Feb  9 00:00:03 2014 SYNC_getCom:  error receiving command
> Sun Feb  9 00:00:03 2014 FSYNC_com:  read failed; dropping connection (cnt=493489)
> Sun Feb  9 00:00:03 2014 _VLockFd: conflicting lock held on fd 225, offset 538046785 by pid 4129 (locktype=1)
> Sun Feb  9 00:00:03 2014 VAttachVolume: another program has vol 538046785 locked
> Sun Feb  9 00:00:03 2014 VPreattachVolumeByVp_r: volume 538046785 not in quiescent state (state 2 flags 0x18)
[ ... ]
> Sun Feb  9 00:00:03 2014 1 Volser: Clone: Recloning volume 538046785 to volume 538046787
> Sun Feb  9 00:00:03 2014 SYNC_ask:  length field in response inconsistent on circuit 'FSSYNC'
> Sun Feb  9 00:00:03 2014 SYNC_ask: protocol communications failure on circuit 'FSSYNC'; attempting reconnect to server
[ ... ]

That " _VLockFd: conflicting lock held" and "VAttachVolume:
another program has vol NNNN locked" looks vaguely familiar, and
in a case that I have seen it was because a DB server was
offline, and 'vos' took a very very long time to switch to an
online one. But this was with 1.4 and supposedly 1.6 should have
a shorter timeout.

In another case that vaguely resembles this there was a race
between creating a clone and registering it in the VLDB:

  http://rt.central.org/rt/Ticket/Display.html?id=131797

It would be interesting to know what processes 21378, 4129, 7155
were doing and why they held a lock on the RW original.