[OpenAFS] vos moves can be slow, and we have some data

Derrick Brashear shadow@gmail.com
Tue, 25 Aug 2009 07:16:17 -0400


On Mon, Aug 24, 2009 at 10:57 PM, Dan Hyde<drh@umich.edu> wrote:
> We're getting ready for Fall term. =A0Our oncall staff don't like getting
> paged in the middle of the night to handle fileserver capacity issues.
> So, we moved a bunch of volumes around this past week to even out the
> load. =A0We kept track of when the moves started, and looked at the ones
> that took the longest.
>
> Here's part of the move log for the volume that took the longest to
> move (starting at 11:25:26, with the previous 4 seconds earlier, and
> the next one 28:22 minutes later):
>
> 2009/08/14 11:25:20: Processing volume 296, 'user.xxxxxx'.
> [...]
> 2009/08/14 11:25:26: Processing volume 297, 'user.yyyyyy'.
> vos move user.yyyyyy afs58.ifs.umich.edu /vicepa localhost a -localauth
> Volume 1956456697 moved from afs58.ifs.umich.edu /vicepa to localhost /vi=
cepa
> vos backup user.yyyyyy -localauth
> Created backup volume for user.yyyyyy
> 2009/08/14 11:53:48: Processing volume 298, 'user.zzzzzz'.
>
> A vos move shouldn't take 28:22 minutes. =A0Here's part of a combined
> VolserLog, with fileserver afs58 moving to afs61, that shows what's
> going on.
>
> 2009/08/14 11:25:26 afs58 1 Volser: Clone: Cloning volume 1956456697 to n=
ew volume 1957504905
> 2009/08/14 11:25:34 afs61 VAttachVolume: Failed to open /vicepa/V19564566=
97.vol (errno 2)
> 2009/08/14 11:25:34 afs61 1 Volser: CreateVolume: volume 1956456697 (user=
.yyyyyy) created
> 2009/08/14 11:30:59 afs58 trans 1914273 on volume 1957504905 is older tha=
n 300 seconds
> 2009/08/14 11:31:01 afs61 trans 236047 on volume 1956456697 is older than=
 300 seconds
> 2009/08/14 11:31:29 afs58 trans 1914273 on volume 1957504905 is older tha=
n 330 seconds
> 2009/08/14 11:31:31 afs61 trans 236047 on volume 1956456697 is older than=
 330 seconds
> [...]
> 2009/08/14 11:48:30 afs58 trans 1914273 on volume 1957504905 is older tha=
n 1350 seconds
> 2009/08/14 11:48:32 afs61 trans 236047 on volume 1956456697 is older than=
 1350 seconds
> 2009/08/14 11:49:00 afs58 trans 1914273 on volume 1957504905 is older tha=
n 1380 seconds
> 2009/08/14 11:49:02 afs61 trans 236047 on volume 1956456697 is older than=
 1380 seconds
> 2009/08/14 11:49:22 afs58 1 Volser: Delete: volume 1956456697 deleted
> 2009/08/14 11:53:44 afs58 1 Volser: Delete: volume 1957504905 deleted
> 2009/08/14 11:53:44 afs61 1 Volser: Clone: Cloning volume 1956456697 to n=
ew volume 1956456699
>
> The transaction seems to get stuck.

If you have additional information, like packet captures, showing
that, fine. Otherwise, the information included doesn't demonstrate
it. I'm not saying you're wrong. I'm saying it's not included here.

> We're running OpenAFS-1.4.11 on linux 2.6.30.2. =A0The two machine rooms
> are on opposite sides of town, 6 or 7 hops away, on what I believe is a
> 10GB network.
>
> If anyone else is seeing the '... is older than ...' messages in their
> VolserLog's, we'd like to hear about it. =A0If anyone has any ideas what
> may be causing this, we'd really like to hear that, too.

Sure. Lots of them. They don't only print for something that "should
be done". Guess what happens whem (for example) you move a large
volume over long distances?

We could use enhanced Rx statistics (thanks, Jake!) to guess when it
should be done and only log starting then.