[OpenAFS] Strange problem in 2 volumes that do not "vos move" or "vos restore"

Jose M Calhariz jose.calhariz@tecnico.ulisboa.pt
Sat, 2 Sep 2017 10:38:26 +0100


Hi,

I am one of the maintainers of a OpenAFS cell.  The cell runs on
Debian 8.x with OpenAFS 1.6.15-1 from backports.  The clients I use
for maintenance are Debian 8.x and Debian 9.x, aka 1.6.15-1 and 1.6.20-2.

I have 2 volumes that refused to do a maintenance "vos move".  I
started to investigate one of the volumes.  To my surprise, the "vos
restore" fails from a Debian 9.x client.  Then I tried to do a
"bos salvage", but it did not found a problem.  Then I upgraded
the fileserver to Debian 9.x and repeated the "bos salvage" and the
"vos restore" but there is still a problem.

Follow what I think is the relevant logs.  The SalsrvLog on the
destination of a vos move is too big, so I can send it compressed by
private email.  I am open to do more tests and to provide more logs.
The problematic volumes are preventing the a shutdown of the
fileservers for a must needed maintenance.



cat VolserLog on the destination fileserver during a vos restore:

Fri Sep  1 15:01:17 2017 Starting AFS Volserver 2.0 (/usr/lib/openafs/davolserver -p 16 -udpsize 16777216)
Fri Sep  1 15:30:49 2017 trans 22 on volume 538192990 is older than 300 seconds
Fri Sep  1 15:31:19 2017 trans 22 on volume 538192990 is older than 330 seconds
Fri Sep  1 15:31:49 2017 trans 22 on volume 538192990 is older than 360 seconds
Fri Sep  1 15:32:19 2017 trans 22 on volume 538192990 is older than 390 seconds
Fri Sep  1 15:32:49 2017 trans 22 on volume 538192990 is older than 420 seconds
Fri Sep  1 15:33:19 2017 trans 22 on volume 538192990 is older than 450 seconds
Fri Sep  1 15:33:49 2017 trans 22 on volume 538192990 is older than 480 seconds
Fri Sep  1 15:34:19 2017 trans 22 on volume 538192990 is older than 510 seconds
Fri Sep  1 15:34:49 2017 trans 22 on volume 538192990 is older than 540 seconds
Fri Sep  1 15:35:19 2017 trans 22 on volume 538192990 is older than 570 seconds
Fri Sep  1 15:35:49 2017 trans 22 on volume 538192990 is older than 600 seconds
Fri Sep  1 15:36:19 2017 trans 22 on volume 538192990 is older than 630 seconds
Fri Sep  1 15:36:49 2017 trans 22 on volume 538192990 is older than 660 seconds
Fri Sep  1 15:37:19 2017 trans 22 on volume 538192990 is older than 690 seconds
Fri Sep  1 15:37:50 2017 trans 22 on volume 538192990 is older than 720 seconds
Fri Sep  1 15:38:20 2017 trans 22 on volume 538192990 is older than 750 seconds
Fri Sep  1 15:38:50 2017 trans 22 on volume 538192990 is older than 780 seconds
Fri Sep  1 15:39:20 2017 trans 22 on volume 538192990 is older than 810 seconds
Fri Sep  1 15:39:50 2017 trans 22 on volume 538192990 is older than 840 seconds
Fri Sep  1 15:40:04 2017 1 Volser: Delete: volume 538192990 deleted 
Fri Sep  1 15:40:26 2017 1 Volser: CreateVolume: volume 538193041 (test) created
Fri Sep  1 15:40:48 2017 1 Volser: ReadVnodes: Error writing vnode index: Invalid argument; restore aborted
Fri Sep  1 15:40:48 2017 Scheduling salvage for volume 538193041 on part /vicepa over FSSYNC
Fri Sep  1 15:46:30 2017 1 Volser: Delete: volume 538193041 deleted 
Fri Sep  1 15:47:22 2017 1 Volser: CreateVolume: volume 538193042 (test) created
Fri Sep  1 15:47:45 2017 1 Volser: ReadVnodes: Error writing vnode index: Invalid argument; restore aborted
Fri Sep  1 15:47:45 2017 Scheduling salvage for volume 538193042 on part /vicepa over FSSYNC
Fri Sep  1 15:54:21 2017 1 Volser: Delete: volume 538193042 deleted 
Sat Sep  2 08:57:39 2017 1 Volser: CreateVolume: volume 538193047 (test) created
Sat Sep  2 08:58:03 2017 1 Volser: ReadVnodes: Error writing vnode index: Invalid argument; restore aborted
Sat Sep  2 08:58:03 2017 Scheduling salvage for volume 538193047 on part /vicepa over FSSYNC

>From a OpenAFS Client 1.6.20:
bos salvage -server afs16 -partition /vicepa -volume ciist_compras005 -showlog -forceDAFS -localauth
Starting salvage.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: waiting for salvage to complete.
bos: salvage completed
SalvageLog:
@(#) OpenAFS 1.6.15-1-debian built  2015-11-29 
08/28/2017 19:46:54 STARTING AFS SALVAGER 2.4 (/usr/lib/openafs/salvager -force)
08/28/2017 19:46:54 Starting salvage of file system partition /vicepa
08/28/2017 19:46:54 SALVAGING FILE SYSTEM PARTITION /vicepa (device=vicepa)
08/28/2017 19:46:54 ***Forced salvage of all volumes on this partition***
08/28/2017 19:50:31 4 nVolumesInInodeFile 128 
08/28/2017 19:50:31 CHECKING CLONED VOLUME 538034125.
08/28/2017 19:50:31 ciist_compras005.backup (538034125) updated 08/27/2017 21:35
08/28/2017 19:50:39 SALVAGING VOLUME 538034123.
08/28/2017 19:50:39 ciist_compras005 (538034123) updated 08/28/2017 19:41
08/28/2017 19:50:47 totalInodes 10388297
08/28/2017 21:24:09 Salvaged ciist_compras005 (538034123): 10385952 files, 19652064 blocks
08/28/2017 21:24:09 CHECKING CLONED VOLUME 538185110.
08/28/2017 21:24:09 base.groups2.readonly (538185110) updated 08/03/2017 17:19
08/28/2017 21:24:09 totalInodes 526
08/28/2017 21:24:09 SALVAGING OF PARTITION /vicepa COMPLETED


vos move -id ciist_compras005 -fromserver afs16 -frompartition /vicepa -toserver afs15 -topartition /vicepa -verbose -localauth
Starting transaction on source volume 538034123 ... done
Allocating new volume id for clone of volume 538034123 ... done
Cloning source volume 538034123 ... done
Ending the transaction on the source volume 538034123 ... done
Starting transaction on the cloned volume 538193048 ... done
Setting flags on cloned volume 538193048 ... done
Getting status of cloned volume 538193048 ... done
Deleting pre-existing destination volume 538034123 ...Creating the destination volume 538034123 ... done
Setting volume flags on destination volume 538034123 ... done
Dumping from clone 538193048 on source to volume 538034123 on destination ...
Failed to move data for the volume 538034123
   VOLSER: Problems encountered in doing the dump !
vos move: operation interrupted, cleanup in progress...
clear transaction contexts
Recovery: Releasing VLDB lock on volume 538034123 ... done
Recovery: Ending transaction on clone volume ... done
Recovery: Ending transaction on destination volume ... done
Recovery: Accessing VLDB.
move incomplete - attempt cleanup of target partition - no guarantee
Recovery:Deleting destination volume 538034123 ... done
Recovery: Creating transaction on source volume 538034123 ... done
Recovery: Setting flags on source volume 538034123 ... done
Recovery: Ending transaction on source volume 538034123 ... done
Recovery:Deleting clone volume 538193048 ... done
cleanup complete - user verify desired result


On afs15:

head -40 VolserLog
Sat Sep  2 09:52:47 2017 Starting AFS Volserver 2.0 (/usr/lib/openafs/davolserver -p 16 -udpsize 16777216)
Sat Sep  2 09:54:51 2017 VReadVolumeDiskHeader: Couldn't open header for volume 538034123 (errno 2).
Sat Sep  2 09:54:51 2017 attach2: forcing vol 538034123 to error state (state 0 flags 0x0 ec 103)
Sat Sep  2 09:54:51 2017 1 Volser: CreateVolume: volume 538034123 (ciist_compras005) created
Sat Sep  2 09:55:49 2017 1 Volser: ReadVnodes: Error writing vnode index: Invalid argument; restore aborted
Sat Sep  2 09:55:49 2017 Scheduling salvage for volume 538034123 on part /vicepa over FSSYNC
Sat Sep  2 09:55:49 2017 SYNC_ask: negative response on circuit 'FSSYNC'
Sat Sep  2 09:55:49 2017 FSYNC_askfs: FSSYNC request denied for reason=65537
Sat Sep  2 09:55:49 2017 VAttachVolume: file server says volume 538034123 is salvaging
Sat Sep  2 09:55:49 2017 Scheduling salvage for volume 538034123 on part /vicepa over FSSYNC
Sat Sep  2 09:55:52 2017 SYNC_ask: negative response on circuit 'FSSYNC'
Sat Sep  2 09:55:52 2017 FSYNC_askfs: FSSYNC request denied for reason=65537
Sat Sep  2 09:55:52 2017 VAttachVolume: file server says volume 538034123 is salvaging
Sat Sep  2 09:55:52 2017 Scheduling salvage for volume 538034123 on part /vicepa over FSSYNC
Sat Sep  2 09:55:55 2017 SYNC_ask: negative response on circuit 'FSSYNC'
Sat Sep  2 09:55:55 2017 FSYNC_askfs: FSSYNC request denied for reason=65537
Sat Sep  2 09:55:55 2017 VAttachVolume: file server says volume 538034123 is salvaging
Sat Sep  2 09:55:55 2017 Scheduling salvage for volume 538034123 on part /vicepa over FSSYNC
Sat Sep  2 09:55:58 2017 SYNC_ask: negative response on circuit 'FSSYNC'
Sat Sep  2 09:55:58 2017 FSYNC_askfs: FSSYNC request denied for reason=65537
Sat Sep  2 09:55:58 2017 VAttachVolume: file server says volume 538034123 is salvaging
Sat Sep  2 09:55:58 2017 Scheduling salvage for volume 538034123 on part /vicepa over FSSYNC
Sat Sep  2 09:56:01 2017 SYNC_ask: negative response on circuit 'FSSYNC'
Sat Sep  2 09:56:01 2017 FSYNC_askfs: FSSYNC request denied for reason=65537
Sat Sep  2 09:56:01 2017 VAttachVolume: file server says volume 538034123 is salvaging
Sat Sep  2 09:56:01 2017 Scheduling salvage for volume 538034123 on part /vicepa over FSSYNC
Sat Sep  2 09:56:04 2017 SYNC_ask: negative response on circuit 'FSSYNC'
Sat Sep  2 09:56:04 2017 FSYNC_askfs: FSSYNC request denied for reason=65537
Sat Sep  2 09:56:04 2017 VAttachVolume: file server says volume 538034123 is salvaging
Sat Sep  2 09:56:04 2017 Scheduling salvage for volume 538034123 on part /vicepa over FSSYNC
Sat Sep  2 09:56:07 2017 SYNC_ask: negative response on circuit 'FSSYNC'
Sat Sep  2 09:56:07 2017 FSYNC_askfs: FSSYNC request denied for reason=65537
Sat Sep  2 09:56:07 2017 VAttachVolume: file server says volume 538034123 is salvaging
Sat Sep  2 09:56:07 2017 Scheduling salvage for volume 538034123 on part /vicepa over FSSYNC
Sat Sep  2 09:56:10 2017 SYNC_ask: negative response on circuit 'FSSYNC'
Sat Sep  2 09:56:10 2017 FSYNC_askfs: FSSYNC request denied for reason=65537
Sat Sep  2 09:56:10 2017 VAttachVolume: file server says volume 538034123 is salvaging
Sat Sep  2 09:56:10 2017 Scheduling salvage for volume 538034123 on part /vicepa over FSSYNC
Sat Sep  2 09:56:13 2017 SYNC_ask: negative response on circuit 'FSSYNC'
Sat Sep  2 09:56:13 2017 FSYNC_askfs: FSSYNC request denied for reason=65537

ls -alFh SalsrvLog
-rw-r--r-- 1 root root 8.0M Set  2 09:57 SalsrvLog



Kind regards
Jose M Calhariz

-- 
--
Então o Nhonho é um adulto horizontal!!!

-- Chaves