[OpenAFS] vos moves can be slow, and we have some data

Dan Hyde Dan Hyde <drh@umich.edu>
Mon, 24 Aug 2009 22:57:02 -0400


We're getting ready for Fall term.  Our oncall staff don't like getting
paged in the middle of the night to handle fileserver capacity issues.
So, we moved a bunch of volumes around this past week to even out the
load.  We kept track of when the moves started, and looked at the ones
that took the longest.

Here's part of the move log for the volume that took the longest to
move (starting at 11:25:26, with the previous 4 seconds earlier, and
the next one 28:22 minutes later):

2009/08/14 11:25:20: Processing volume 296, 'user.xxxxxx'.
[...]
2009/08/14 11:25:26: Processing volume 297, 'user.yyyyyy'.
vos move user.yyyyyy afs58.ifs.umich.edu /vicepa localhost a -localauth
Volume 1956456697 moved from afs58.ifs.umich.edu /vicepa to localhost /vicepa
vos backup user.yyyyyy -localauth
Created backup volume for user.yyyyyy
2009/08/14 11:53:48: Processing volume 298, 'user.zzzzzz'.

A vos move shouldn't take 28:22 minutes.  Here's part of a combined
VolserLog, with fileserver afs58 moving to afs61, that shows what's
going on.

2009/08/14 11:25:26 afs58 1 Volser: Clone: Cloning volume 1956456697 to new volume 1957504905
2009/08/14 11:25:34 afs61 VAttachVolume: Failed to open /vicepa/V1956456697.vol (errno 2)
2009/08/14 11:25:34 afs61 1 Volser: CreateVolume: volume 1956456697 (user.yyyyyy) created
2009/08/14 11:30:59 afs58 trans 1914273 on volume 1957504905 is older than 300 seconds
2009/08/14 11:31:01 afs61 trans 236047 on volume 1956456697 is older than 300 seconds
2009/08/14 11:31:29 afs58 trans 1914273 on volume 1957504905 is older than 330 seconds
2009/08/14 11:31:31 afs61 trans 236047 on volume 1956456697 is older than 330 seconds
[...]
2009/08/14 11:48:30 afs58 trans 1914273 on volume 1957504905 is older than 1350 seconds
2009/08/14 11:48:32 afs61 trans 236047 on volume 1956456697 is older than 1350 seconds
2009/08/14 11:49:00 afs58 trans 1914273 on volume 1957504905 is older than 1380 seconds
2009/08/14 11:49:02 afs61 trans 236047 on volume 1956456697 is older than 1380 seconds
2009/08/14 11:49:22 afs58 1 Volser: Delete: volume 1956456697 deleted 
2009/08/14 11:53:44 afs58 1 Volser: Delete: volume 1957504905 deleted 
2009/08/14 11:53:44 afs61 1 Volser: Clone: Cloning volume 1956456697 to new volume 1956456699

The transaction seems to get stuck.

We're running OpenAFS-1.4.11 on linux 2.6.30.2.  The two machine rooms
are on opposite sides of town, 6 or 7 hops away, on what I believe is a
10GB network.

If anyone else is seeing the '... is older than ...' messages in their
VolserLog's, we'd like to hear about it.  If anyone has any ideas what
may be causing this, we'd really like to hear that, too.

Thanks, all.