[OpenAFS] Re: vos release running unexpectedly long
Ian Wienand
iwienand@redhat.com
Wed, 11 Sep 2019 17:07:40 +1000
On Wed, Sep 11, 2019 at 10:38:26AM +1000, Ian Wienand wrote:
> On Fri, Aug 30, 2019 at 12:35:09PM +1000, Ian Wienand wrote:
> > I'm struggling to find an angle to debug very long "vos release" times
> > with some of our volumes.
> So some more details ... I've managed to take out any write updates
> from the equation, but the volume with no updates still takes quite a
> long time to release.
> Ergo the "vos release" of the volume with no changes has resulted in
> about 50gb of data being sent to the R/O mirror, and consequently
> long release times.
To follow up on this; auristor was very helpful in IRC (thanks,
again!) and indeed this "vos release" *was* transferring an
unexpectedly large amount of data.
The conclusion reached was that the R/O release "backtracks" 15
minutes from before the "Last Updated" time of the R/W volume when
requesting incremental updates, to avoid issues with clock skew across
hosts.
In our situation, the last volume update was a very large pull from
the upstream mirror (it happens with new distros, big rebuilds, etc).
Then the *next* vos release (the one I documented trying in prior
mail) does do incremental updates, but from 15 minutes before the the
last update -- in our case this would be basically the whole mirror
pull; again. This means in our cron jobs we are pulling lots of data,
taking lots of time, hitting timeouts, which then aborts and locks
volumes, which then makes a negative feedback loop of even more data
to pull next time.
Indeed while successive "vos release" would pull all 50gb; by touching
a file in the root directory and waiting the next "vos release"
completed in seconds.
The solution suggested is a 15+ minute sleep and then a trivial update
to the volume. This ensures that *next time* you release, you only
backtrack into one trivial update and don't risk pulling much more
data than required. I implemented this in our scripts with [1]
For completeness, I have captured a run of the mirror rsync and
extracted the file server audit logs for that run in [2]. However, I
think rsync touching too many files is a red-herring.
The other thing suggested was that timeouts are best worked around by
using "-localauth" to do the vos release somewhere where it won't
timeout. remctl was suggested [3] and is apparently commonly used for
this purpose.
Thanks for the input,
-i
[1] https://review.opendev.org/#/c/681367
[2] http://people.redhat.com/~iwienand/fedora-mirror-11-09-2019.tar.gz
[3] https://www.eyrie.org/~eagle/software/remctl/