[OpenAFS] Re: vos release running unexpectedly long

Ian Wienand iwienand@redhat.com
Wed, 11 Sep 2019 10:38:26 +1000


On Fri, Aug 30, 2019 at 12:35:09PM +1000, Ian Wienand wrote:
> I'm struggling to find an angle to debug very long "vos release" times
> with some of our volumes.

So some more details ... I've managed to take out any write updates
from the equation, but the volume with no updates still takes quite a
long time to release.

I first ran a release on the volume:

 ...
 Released volume mirror.fedora successfully
 root@afs01:~# date
 Tue Sep 10 23:19:37 UTC 2019

I then quickly started a new release; nothing had written to this
volume in the mean time (the only server that could is shutdown):

 root@afs01:~# date
 Tue Sep 10 23:24:45 UTC 2019
 root@afs01:~# vos release -localauth -v mirror.fedora

So this should be a zero-delta update.  However, this took ~ 30
minutes:

 Released volume mirror.fedora successfully
 root@afs01:~# date
 Tue Sep 10 23:55:27 UTC 2019

The place it spends time is

 Starting ForwardMulti from 536871007 to 536871007 on afs02.dfw.openstack.org (as of Mon Sep  9 06:23:11 2019).

I had rxdebug and netstat running in a loop on afs02 while this was
happening.  The numbers seem to add up that about 50gb of data is
transferred as part of the (theoretically zero-delta) release.

 start of vos release netstat
  1969282994 packets received
  InOctets: 2855172400594

 end of vos release netstat
  2003101829 packets received
  InOctets: 2904920938101

 final rxdebug report for thread before it disappeared:
  Received 157738240 bytes in 33476256 packets
 (it seems the received stats rolls over at 4gb?  but packet count is accurate)

 Total octets recevied over period:
  2904920938101 - 2855172400594 = 49748537507 (49.74gb)

 Total packets recevied over period:
  2003101829 - 1969282994 = 33818835

The approx packet size is sane for mtu
  49748537507 / 33818835 = 1471.03049253

Thus calculating the percentage of total packets _not_ accounted for
by rxdebug:

 (33818835 - 33476256) / 33818835 * 100 ~= 1%

So it lines up that 99% of all of the system traffic seen via netstat
over the release period is accounted for by the update thread reported
by rxdebug.

Ergo the "vos release" of the volume with no changes has resulted in
about 50gb of data being sent to the R/O mirror, and consequently
long release times.

I do have the file audit logs for this period, although I'm not sure
what to look for exactly.

-i