[OpenAFS] Odd error on 'vos move'

Mon, 7 Dec 2015 15:51:22 -0500 (EST)

On Mon, 7 Dec 2015, Garance A Drosehn wrote:

> Hi.
>
> I've been busy moving our AFS volumes from ancient file servers to
> up-to-date file servers.  So far this has been going along well,
> but last week I ran into an odd error moving one 10.79 GiB file.
>
> My main question is:  Could a problem like this be caused by my
> AFS token expiring in the middle of the transfer?  Here's the
> output from vos-move:
>
> /usr/sbin/vos move -id <_details_>  -verbose
>    Starting transaction on source volume <__old__> ... done
>    Allocating new volume id for clone of volume <__old__> ... done
>    Cloning source volume <__old__> ... done
>    Ending the transaction on the source volume <__old__> ... done
>    Starting transaction on the cloned volume <_clone_> ... done
>    Setting flags on cloned volume <_clone_> ... done
>    Getting status of cloned volume <_clone_> ... done
>    Deleting pre-existing destination volume <__old__> ...Creating the
> destination volume <__old__> ... done
>    Setting volume flags on destination volume <__old__> ... done
>    Dumping from clone <_clone_> on source to volume <__old__> on destination
> ...vos move: operation interrupted, cleanup in progress...
>    clear transaction contexts
>    Recovery: Releasing VLDB lock on volume <__old__> ... done
>    Recovery: Ending transaction on clone volume ... done
>    Recovery: Ending transaction on destination volume ... done
>    Recovery: Accessing VLDB.
>    FATAL: VLDB access error: abort cleanup
>    cleanup complete - user verify desired result
> #------>Error-> *** cs=256 ***
>
> The vos-move command took about 54 minutes.  It started after I
> had moved several other large volumes, and it happened that my
> AFS token expired in the middle of this vos-move.  I was doing
> some other things in AFS at the time, and the token could not
> have been expired longer than a minute or two before I noticed
> it.  I did a new 'klog', and it was at least five minutes later
> before the vos-move terminated.  I suspect it was more like
> 10-15 minutes, but I didn't really keep track of that.
>
> So, could the problem have been caused by the token expiring in
> the middle of the transfer?

Yes.  The client will not create a new connection to pick up the new
token, and will continue using the old token until the server notices it
is bad and sends a new challenge (usually around expiry+skew window).

> At this point, if I do a 'listvol' on both the source and
> destination servers, the volume exists on both of them.  On
> the destination server the volume is marked as 'Off-line'.
> If I do a 'vos examine', the volume is listed as being on
> the original (source) server, and is also marked as LOCKED.
>
> I assume that the thing to do right now would be to:
>   1. vos-remove the copy which exists on the destination
>      file server (and which is not shown in vos-examine).
>   2. vos-unlock the copy which exists on the original
>      file server.
>   3. Retry the vos-move, this time making sure my AFS token
>      won't expire in the middle of the transfer!
>
> Does this seem reasonable?  Is there any other checks I should
> do before trying those?  I was able to read all the data in the
> volume (using 'md5sum') without warnings or errors showing up
> in any log files on the server.

That sounds like a correct procedure.  Note that the credentials used by
-localauth do not expire; I suggest using that for a long-running move.

-Ben