[OpenAFS] Dafileserver does not manage to slavage volume

Wed, 8 Jul 2015 18:31:41 -0400

On Wed, 8 Jul 2015 22:58:16 +0200
Harald Barth <haba@kth.se> wrote:

> > Was there anything in the 'SalsrvLog' that indicates why the volume was
> > not salvaged?
> 
> No, the last entry about that volume is 
> 
> 07/07/2015 06:08:17 dispatching child to salvage volume 536870963...
> 
> >> # bos salvage mount-kilimanjaro.stacken.kth.se a 536870963 -showlog
> >> This is a demand attach fileserver.  Are you sure you want to proceed with a manual salvage?
> >> must specify -forceDAFS flag in order to proceed.
> > 
> > Yes, that command will schedule a salvage of the volume with the salvage
> > server.
> 
> If I only give the command without -forceDAFS nothing happens (as the text suggests).

Sorry, I meant to say, that command but with the additional -forceDAFS flag
should schedule a salvage of the given volume number.  The fileserver will stay
online. If you do not specify the volume number, then the fileserver will be
shutdown while the salvager runs for the full partition, or all the partitions
if you give the -all flag.

> > If you give -forceDAFS -all to do a complete salvage of all the partitions, or
> > specify a -partition to salvage, the bosserver will shutdown the file server
> > and salvage server, and run the salvager process, like pre-dafs.  This means
> > the file server is offline while the partitions are salvaged.
> 
> Ok, seems I need to run a short interrupt in the fileserver as I don't seem to
> be able to get a salvage of the volume in some other way. Lets go....
> 
> That seems broken. After issuing the bos salvage command, I get a
> salvageserver -client which does _nothing_...

The salvagerserver -client process schedules the salvage by contacting the salvage
server.

> 
> # ps augxww | grep salvage
> root      3507  0.0  0.0  41932  1264 ?        Sl   Jul07   0:01 /usr/lib/openafs/salvageserver -datelogs -parallel all8 -orphans attach
> root      5720  0.0  0.0  41936   628 ?        S    Jul07   0:00 /usr/lib/openafs/salvageserver -datelogs -parallel all8 -orphans attach
> root      9381  0.0  0.0  28152  3376 pts/0    S+   22:38   0:00 bos salvage mount-kilimanjaro.stacken.kth.se a -volume 536870963 -showlog -local -forceDAFS
> root      9382  0.0  0.0   9012   884 ?        S    22:38   0:00 /usr/lib/openafs/salvageserver -client /vicepa 536870963
> root      9458  0.0  0.0   8948   812 pts/1    S+   22:41   0:00 grep salvage
> 
> # strace -p 9382
> Process 9382 attached - interrupt to quit
> restart_syscall(<... resuming interrupted call ...>) = 0
> time(NULL)                              = 1436388191
> write(3, "\3\0\0\0\223\0\0\0\223\0\0\0\0\0\0\0\246$\0\0\0\207\255g\4\0\1\0\0\0\1\0"..., 96) = 96
> readv(3, [{"\3\0\0\0\203\3\0\0\223\0\0\0\203\3\0\0@\0\0\0\0\0\0\0\0\0\0\0\2\1\0\0", 32}, {"\2\0\0\0\213\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32}], 2) = 64
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> nanosleep({2, 0}, 0x7fff926ee230)       = 0
> time(NULL)                              = 1436388193
> write(3, "\3\0\0\0\224\0\0\0\224\0\0\0\0\0\0\0\246$\0\0\0\207\255g\4\0\1\0\0\0\1\0"..., 96) = 96
> 
> And so on over and over.

The client was just in contact with the salvageserver, which looks like process
3507.  In this case, I think it would be interesting to see what process 5720
was doing.  I assume it's parent was 3507.

> 
> Ok, kill everything off and restart....
> 
> 07/08/2015 22:48:47 dispatching child to salvage volume 536870963...
> 07/08/2015 22:48:48 2 nVolumesInInodeFile 64 
> 07/08/2015 22:48:48 waiting for fileserver to finish scanning partition /vicepa...
> 07/08/2015 22:48:54 CHECKING CLONED VOLUME 536870965.
> 07/08/2015 22:48:54 stackware.backup (536870965) updated 07/01/1998 18:19
> 07/08/2015 22:48:54 Vnode 1: length incorrect; (is 2048 should be 18432)
> 07/08/2015 22:48:54 SALVAGING VOLUME 536870963.
> 07/08/2015 22:48:54 stackware (536870963) updated 07/01/1998 18:19
> 07/08/2015 22:48:54 Vnode 1: length incorrect; changed from 2048 to 18432
> 07/08/2015 22:48:54 totalInodes 3408
> 07/08/2015 22:48:54 Salvaged stackware (536870963): 3401 files, 71692 blocks
> 07/08/2015 22:49:55 dispatching child to salvage volume 536870965...
> 07/08/2015 22:49:55 dispatching child to salvage volume 536870963...
> 07/08/2015 22:49:55 namei_ListAFSSubDirs: warning: VG 536870965 does not have a link table; salvager will recreate it.
> 07/08/2015 22:49:55 fileserver requested salvage of clone 536870965; scheduling salvage of volume group 536870963...
> 07/08/2015 22:49:55 1 nVolumesInInodeFile 32 
> 07/08/2015 22:49:55 SALVAGING VOLUME 536870963.
> 07/08/2015 22:49:55 stackware (536870963) updated 07/01/1998 18:19
> 07/08/2015 22:49:55 totalInodes 3405
> 07/08/2015 22:49:56 Salvaged stackware (536870963): 3401 files, 71692 blocks
> 07/08/2015 22:49:56 The volume header file /vicepa/V0536870965.vol is not associated with any actual data (deleted)
> 
> So that time it went well. Wonder what blocked the salvage the first
> and the second time. Looks to me that a salvageserver process can get
> in some non-responsive state that then blocks all further salvage
> attempts.
> 
> Harald.

-- 
Michael Meffie <mmeffie@sinenomine.net>