[OpenAFS] Re: vos release running unexpectedly long

Wed, 11 Sep 2019 17:07:40 +1000

On Wed, Sep 11, 2019 at 10:38:26AM +1000, Ian Wienand wrote:
> On Fri, Aug 30, 2019 at 12:35:09PM +1000, Ian Wienand wrote:
> > I'm struggling to find an angle to debug very long "vos release" times
> > with some of our volumes.

> So some more details ... I've managed to take out any write updates
> from the equation, but the volume with no updates still takes quite a
> long time to release.

> Ergo the "vos release" of the volume with no changes has resulted in
> about 50gb of data being sent to the R/O mirror, and consequently
> long release times.

To follow up on this; auristor was very helpful in IRC (thanks,
again!) and indeed this "vos release" *was* transferring an
unexpectedly large amount of data.

The conclusion reached was that the R/O release "backtracks" 15
minutes from before the "Last Updated" time of the R/W volume when
requesting incremental updates, to avoid issues with clock skew across
hosts.

In our situation, the last volume update was a very large pull from
the upstream mirror (it happens with new distros, big rebuilds, etc).
Then the *next* vos release (the one I documented trying in prior
mail) does do incremental updates, but from 15 minutes before the the
last update -- in our case this would be basically the whole mirror
pull; again.  This means in our cron jobs we are pulling lots of data,
taking lots of time, hitting timeouts, which then aborts and locks
volumes, which then makes a negative feedback loop of even more data
to pull next time.

Indeed while successive "vos release" would pull all 50gb; by touching
a file in the root directory and waiting the next "vos release"
completed in seconds.

The solution suggested is a 15+ minute sleep and then a trivial update
to the volume.  This ensures that *next time* you release, you only
backtrack into one trivial update and don't risk pulling much more
data than required.  I implemented this in our scripts with [1]

For completeness, I have captured a run of the mirror rsync and
extracted the file server audit logs for that run in [2].  However, I
think rsync touching too many files is a red-herring.

The other thing suggested was that timeouts are best worked around by
using "-localauth" to do the vos release somewhere where it won't
timeout.  remctl was suggested [3] and is apparently commonly used for
this purpose.

Thanks for the input,

-i

[1] https://review.opendev.org/#/c/681367
[2] http://people.redhat.com/~iwienand/fedora-mirror-11-09-2019.tar.gz
[3] https://www.eyrie.org/~eagle/software/remctl/