[OpenAFS] RO replication hang on release

J Maynard Gelinas gelinas@lns.mit.edu
Thu, 21 Jul 2005 12:13:02 -0400 (EDT)


   Hello,

   I'm seeing a strange problem whereby a large RO volume replication
hangs at the same place during a vos release. I define "place" as the
"packetRead: 638361" from vos status on the receiving fileserver. The
hosting server with the RW volume has a local RO copy.  Replication to
another offsite fileserver succeeds properly.  VolserLog on both the
sending and recieving fileservers just report transaction times like so:

ctpraid1:/var/log/openafs# tail /var/log/openafs/VolserLog
Thu Jul 21 12:04:58 2005 trans 15 on volume 536870934 is older than 5340 
seconds
Thu Jul 21 12:05:28 2005 trans 15 on volume 536870934 is older than 5370 
seconds

[...]

Note that small RO volumes like root.cell, root.afs, and root.user,
replicate without problem. This is running OAFS-1.2.11 on Debian Woody
3.0. Has been running stable for a long time.

  Changes:

  This particular server stopped performing file services for about a
year, but remained online for db services during the interim. Then the
users for that host moved to a new building, which is now connected not
via the normal fiber connection but by microwave link. This microwave link
does drop packets and is slower than the old fiber link. After the move we
decided to turn on file services again and ran into this problem. Could
this be due to network dropouts? I find that unlikely given that it hits
at the same packet count each time, but the secondary RO copy is hosted on
an offsite fileserver connected via fiber and has no trouble with
replication.

Thanks for any suggestions or help, 
--Maynard