[OpenAFS] Dafileserver does not manage to slavage volume

Harald Barth haba@kth.se
Wed, 08 Jul 2015 22:58:16 +0200 (CEST)


> Was there anything in the 'SalsrvLog' that indicates why the volume was
> not salvaged?

No, the last entry about that volume is 

07/07/2015 06:08:17 dispatching child to salvage volume 536870963...

>> # bos salvage mount-kilimanjaro.stacken.kth.se a 536870963 -showlog
>> This is a demand attach fileserver.  Are you sure you want to proceed with a manual salvage?
>> must specify -forceDAFS flag in order to proceed.
> 
> Yes, that command will schedule a salvage of the volume with the salvage
> server.

If I only give the command without -forceDAFS nothing happens (as the text suggests).

> If you give -forceDAFS -all to do a complete salvage of all the partitions, or
> specify a -partition to salvage, the bosserver will shutdown the file server
> and salvage server, and run the salvager process, like pre-dafs.  This means
> the file server is offline while the partitions are salvaged.

Ok, seems I need to run a short interrupt in the fileserver as I don't seem to
be able to get a salvage of the volume in some other way. Lets go....

That seems broken. After issuing the bos salvage command, I get a
salvageserver -client which does _nothing_...

# ps augxww | grep salvage
root      3507  0.0  0.0  41932  1264 ?        Sl   Jul07   0:01 /usr/lib/openafs/salvageserver -datelogs -parallel all8 -orphans attach
root      5720  0.0  0.0  41936   628 ?        S    Jul07   0:00 /usr/lib/openafs/salvageserver -datelogs -parallel all8 -orphans attach
root      9381  0.0  0.0  28152  3376 pts/0    S+   22:38   0:00 bos salvage mount-kilimanjaro.stacken.kth.se a -volume 536870963 -showlog -local -forceDAFS
root      9382  0.0  0.0   9012   884 ?        S    22:38   0:00 /usr/lib/openafs/salvageserver -client /vicepa 536870963
root      9458  0.0  0.0   8948   812 pts/1    S+   22:41   0:00 grep salvage

# strace -p 9382
Process 9382 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
time(NULL)                              = 1436388191
write(3, "\3\0\0\0\223\0\0\0\223\0\0\0\0\0\0\0\246$\0\0\0\207\255g\4\0\1\0\0\0\1\0"..., 96) = 96
readv(3, [{"\3\0\0\0\203\3\0\0\223\0\0\0\203\3\0\0@\0\0\0\0\0\0\0\0\0\0\0\2\1\0\0", 32}, {"\2\0\0\0\213\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32}], 2) = 64
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7fff926ee230)       = 0
time(NULL)                              = 1436388193
write(3, "\3\0\0\0\224\0\0\0\224\0\0\0\0\0\0\0\246$\0\0\0\207\255g\4\0\1\0\0\0\1\0"..., 96) = 96

And so on over and over.

Ok, kill everything off and restart....

07/08/2015 22:48:47 dispatching child to salvage volume 536870963...
07/08/2015 22:48:48 2 nVolumesInInodeFile 64 
07/08/2015 22:48:48 waiting for fileserver to finish scanning partition /vicepa...
07/08/2015 22:48:54 CHECKING CLONED VOLUME 536870965.
07/08/2015 22:48:54 stackware.backup (536870965) updated 07/01/1998 18:19
07/08/2015 22:48:54 Vnode 1: length incorrect; (is 2048 should be 18432)
07/08/2015 22:48:54 SALVAGING VOLUME 536870963.
07/08/2015 22:48:54 stackware (536870963) updated 07/01/1998 18:19
07/08/2015 22:48:54 Vnode 1: length incorrect; changed from 2048 to 18432
07/08/2015 22:48:54 totalInodes 3408
07/08/2015 22:48:54 Salvaged stackware (536870963): 3401 files, 71692 blocks
07/08/2015 22:49:55 dispatching child to salvage volume 536870965...
07/08/2015 22:49:55 dispatching child to salvage volume 536870963...
07/08/2015 22:49:55 namei_ListAFSSubDirs: warning: VG 536870965 does not have a link table; salvager will recreate it.
07/08/2015 22:49:55 fileserver requested salvage of clone 536870965; scheduling salvage of volume group 536870963...
07/08/2015 22:49:55 1 nVolumesInInodeFile 32 
07/08/2015 22:49:55 SALVAGING VOLUME 536870963.
07/08/2015 22:49:55 stackware (536870963) updated 07/01/1998 18:19
07/08/2015 22:49:55 totalInodes 3405
07/08/2015 22:49:56 Salvaged stackware (536870963): 3401 files, 71692 blocks
07/08/2015 22:49:56 The volume header file /vicepa/V0536870965.vol is not associated with any actual data (deleted)

So that time it went well. Wonder what blocked the salvage the first
and the second time. Looks to me that a salvageserver process can get
in some non-responsive state that then blocks all further salvage
attempts.

Harald.