[OpenAFS] vos move & vos rel failing on large volumes

Lee Damon nomad@ssli-mail.ee.washington.edu
Thu, 10 Apr 2003 09:52:25 -0700


I've cut and pasted a bunch of different suggestions into this email and am
replying to each one in turn.  I would like to thank everyone who has
replied with suggestions or timing information.

> been able to move them.  You may want to issue your vos move command 
> again with the -verbose option to see if there is more info on what 

[root@harrier root]# time vos move -verbose -fromserver kite -frompartition 
/vic epc -toserver potential -topartition /vicepf -id v.s8.cadence.2002a
Starting transaction on source volume 536872898 ... done
Cloning source volume 536872898 ... done
Ending the transaction on the source volume 536872898 ... done
Starting transaction on the cloned volume 1443364917 ... done
Deleting pre-existing volume 536872898 on destination ... done
Creating the destination volume 536872898 ... done
Dumping from clone 1443364917 on source to volume 536872898 on destination 
...Failed to move data for the volume 536872898
   rxk: sealed data inconsistent
vos move: operation interrupted, cleanup in progress...
clear transaction contexts
access VLDB
FATAL: VLDB access error: abort cleanup
cleanup complete - user verify desired result

real    648m58.951s
user    0m0.010s
sys     0m0.006s


>me either. i'd sort of like to see volserver strace/truss/ptrace/whatever
>output while such a slow thing is happening.

Derrick: I'll send this directly, it's *very* verbose, as one would expect 
(22MB of output from running strace for about 4 minutes.)

> Meltdown.pl -s <src volserver> -p 7005 -t 30  (seconds)
>
> and I want to pay attention to the "resend" column.

This looks telling...
/tmp/Meltdown.pl -s kite -p 7005 -t 30

Server: kite, Port: 7005, Interval 30 seconds
Thu Apr 10 09:23:30 PDT 2003

hh:mm:ss wproc nobufs   wpack  fpack    calls     delta  data      resends  
idle
09:23:30 0     0        0      464      8358      0      12387116  686973   11 

09:24:00 0     0        0      464      8358      0      12387116  686973   11 

09:24:30 0     0        0      464      8358      0      12387116  686973   11 

09:25:00 0     0        0      464      8369      11     12387140  686974   11 

09:25:30 0     0        0      432      8369      0      12389368  687103   10 

09:26:00 0     0        0      432      8369      0      12392574  687283   10 

09:26:30 0     0        0      432      8369      0      12395914  687468   10 

09:27:00 0     0        0      432      8369      0      12398900  687639   10 

09:27:30 0     0        0      432      8369      0      12401784  687806   10 

09:28:00 0     0        0      432      8369      0      12405380  687998   10 

09:28:30 0     0        0      432      8369      0      12408306  688166   10 

09:29:00 0     0        0      432      8369      0      12411452  688340   10 


And, for the fileserver we're trying to move to:

/tmp/Meltdown.pl -s potential -p 7005 -t 30

Server: potential, Port: 7005, Interval 30 seconds
Thu Apr 10 09:26:52 PDT 2003

hh:mm:ss wproc nobufs   wpack  fpack    calls     delta  data      resends  
idle
09:26:52 0     0        0      462      2310      0      3812638   253710   10 

09:27:22 0     0        0      462      2310      0      3812638   253710   10 

09:27:52 0     0        0      462      2310      0      3812638   253710   10 

09:28:22 0     0        0      462      2310      0      3812638   253710   10 

09:28:52 0     0        0      462      2310      0      3812638   253710   10 

09:29:22 0     0        0      462      2310      0      3812638   253710   10 

09:29:52 0     0        0      462      2310      0      3812638   253710   10 

09:30:22 0     0        0      462      2310      0      3812638   253710   10 

09:30:52 0     0        0      462      2310      0      3812638   253710   10 

09:31:22 0     0        0      462      2310      0      3812638   253710   10 

09:31:52 0     0        0      462      2310      0      3812638   253710   10 

09:32:22 0     0        0      462      2310      0      3812638   253710   10 

09:32:52 0     0        0      462      2310      0      3812638   253710   10 


> If this column is increasing while the release/move is in
> progress than I would want to look at the routes between
> these 2 volserver machines.

The routes look clean, but they are something I have very little control
over.  I'm trying to get networking involved, we'll see.  One thing I
discovered yesterday was that the new fileserver (potential) was ordered
with 100baseT cards instead of gig, but I don't think that is sufficient
to explain the problem.

The thing is, we were seeing slow moves before we brought potential in.
Our first two fileservers, kite & harrier, are connected directly to
the same gig switch, and large releases were taking hours there too.  I'm
going to do the same set of tests (with a much smaller volume :) between
them and see if resends climb while that is happening.    Since I
don't want to leave the move in an "indeterminate state", I'll have to wait
for that to time out and die of its own unnatural causes first.
 
> Is duplex settings set properly on the network card and
> switch port ?  "netstat -ni" can show collisions if this is
> a problem.

netstat -ni shows no collisions or other errors on any interface on any of
our fileservers.
 
> Can the network handle fragmented packets ?

yes, I believe so.
 
> How FULL is the destination vice partition ?  Does it have
> ample free inodes around ?

Darn near empty.  It's a new fileserver.

> I would also look at the the volserver options
>   - using         -udpsize 1048576

I'm a bit lost as to how to reset BosConfig to pass this option on to volserver
without trashing my entire cell.  Any hints for a lost soul?

> Are there any other vos tranactions happening on the src or
> destination server during these times

In the duration of an 10+hour move, there are a few other vos releases that
happen.

>       vos status <src volserver>
>       vos status <dst volserver>

Here's a snapshot from just now:

 [root@harrier root]# vos status kite
Total transactions: 1
--------------------------------------
transaction: 7376  created: Thu Apr 10 09:25:01 2003
attachFlags:  offline 
volumeStatus: 
volume: 1443364918  partition: /vicepc  procedure: Forward
packetRead: 1  lastReceiveTime: 96171503  packetSend: 41257  lastSendTime: 
96171503
--------------------------------------

[root@harrier root]# vos status potential
Total transactions: 1
--------------------------------------
transaction: 1815  created: Thu Apr 10 09:25:11 2003
volumeStatus: 
volume: 536872898  partition: /vicepf  procedure: Restore
packetRead: 41366  lastReceiveTime: 96171459  packetSend: 1  lastSendTime: 
96171459
--------------------------------------


Another interesting tidbit is in the VolserLog on kite. Volume 1443364918
is the one I'm trying to move to (on potential), 536872898 is the source (on
kite).

Thu Apr 10 09:38:01 2003 trans 7376 on volume 1443364918 is older than 780 
secon
ds
Thu Apr 10 09:38:31 2003 trans 7376 on volume 1443364918 is older than 810 
secon
ds
Thu Apr 10 09:39:01 2003 trans 7376 on volume 1443364918 is older than 840 
secon
ds
Thu Apr 10 09:39:31 2003 trans 7376 on volume 1443364918 is older than 870 
secon
ds
Thu Apr 10 09:40:01 2003 trans 7376 on volume 1443364918 is older than 900 
secon
ds
Thu Apr 10 09:40:31 2003 trans 7376 on volume 1443364918 is older than 930 
secon
ds
Thu Apr 10 09:41:01 2003 trans 7376 on volume 1443364918 is older than 960 
secon
ds


The previous move said that for a very long time, then started saying the
trans was idle:

Thu Apr 10 01:06:01 2003 trans 7236 on volume 1443364917 is older than 38730 
sec
onds
Thu Apr 10 01:06:31 2003 trans 7236 on volume 1443364917 is older than 38760 
sec
onds
Thu Apr 10 01:07:01 2003 trans 7236 on volume 1443364917 is older than 38790 
sec
onds
Thu Apr 10 01:12:31 2003 trans 7236 on volume 1443364917 has been idle for 
more
than 300 seconds

and finally:

Thu Apr 10 01:16:01 2003 trans 7236 on volume 1443364917 has been idle for 
more
than 510 seconds
Thu Apr 10 01:16:31 2003 trans 7236 on volume 1443364917 has been idle for 
more
than 540 seconds
Thu Apr 10 01:17:01 2003 trans 7236 on volume 1443364917 has been idle for 
more
than 570 seconds
Thu Apr 10 01:17:31 2003 trans 7236 on volume 1443364917 has been idle for 
more
than 600 seconds
Thu Apr 10 01:17:31 2003 trans 7236 on volume 1443364917 has timed out

thanks again for all the hints,
nomad
 -----------                       - Lee "nomad" Damon -          \
work: nomad@ee.washington.edu                                      \
play: nomad@castle.org    or castle!nomad                           \
                                                                    /\
Sr. Systems Admin, UWEE SSLI Lab                                   /  \
                "Celebrate Diversity"                             /    \