[OpenAFS] Re: Crazy DAFS problem (with log)

Ryan C. Underwood nemesis@icequake.net
Sun, 20 Mar 2011 20:59:43 -0500


On Sun, Mar 20, 2011 at 07:42:12PM -0500, Andrew Deason wrote:
> 
> What's in SalsrvLog and SalsrvLog.old? It should have a bunch of stuff
> on the salvages here. Also, BosLog for this time period would be good
> for completeness.

Not much interesting in BosLog.old:
Sun Mar 20 04:00:29 2011: Core limits now -1 -1
Sun Mar 20 04:00:29 2011: Server directory access is okay
Sun Mar 20 16:32:17 2011: ptserver exited on signal 15
Sun Mar 20 16:32:17 2011: vlserver exited on signal 15
Sun Mar 20 16:32:17 2011: dafs:salsrv exited on signal 15
Sun Mar 20 16:32:17 2011: dafs:vol exited on signal 15
Sun Mar 20 16:32:18 2011: dafs:file exited with code 0
Sun Mar 20 16:32:18 2011: Shutdown of BOS server and processes in response to signal 15

SalsrvLog, on the other hand, shows craziness.  It just keeps salvaging
the same two volumes again and again, for hours.

03/20/2011 16:34:06 Starting OpenAFS Online Salvage Server 2.4 (/usr/lib/openafs/salvageserver)
03/20/2011 16:34:30 dispatching child to salvage volume 536870915...
03/20/2011 16:34:30 dispatching child to salvage volume 536871273...
03/20/2011 16:34:30 2 nVolumesInInodeFile 56 
03/20/2011 16:34:30 waiting for fileserver to finish scanning partition /vicepa...
03/20/2011 16:34:31 CHECKING CLONED VOLUME 536871274.
03/20/2011 16:34:31 www_logs.readonly (536871274) updated 03/20/2011 05:45
03/20/2011 16:34:31 Vnode 22: length incorrect; (is 1773498 should be 0)
03/20/2011 16:34:31 SALVAGING VOLUME 536871273.
03/20/2011 16:34:31 www_logs (536871273) updated 03/20/2011 05:45
03/20/2011 16:34:31 Vnode 22: length incorrect; changed from 1773498 to 0
03/20/2011 16:34:31 Vnode 202: length incorrect; changed from 345557 to 0
03/20/2011 16:34:31 Vnode 204: length incorrect; changed from 86723 to 0
03/20/2011 16:34:31 totalInodes 245
03/20/2011 16:34:32 Salvaged www_logs (536871273): 238 files, 241815 blocks
03/20/2011 16:34:30 2 nVolumesInInodeFile 56 
03/20/2011 16:34:30 waiting for fileserver to finish scanning partition /vicepa...
03/20/2011 16:34:31 CHECKING CLONED VOLUME 536870916.
03/20/2011 16:34:31 root.cell.readonly (536870916) updated 03/20/2011 12:39
03/20/2011 16:34:31 Vnode 264: length incorrect; (is 15840312 should be 0)
03/20/2011 16:34:31 SALVAGING VOLUME 536870915.
03/20/2011 16:34:31 root.cell (536870915) updated 03/20/2011 12:41
03/20/2011 16:34:31 Vnode 264: length incorrect; changed from 15840312 to 0
03/20/2011 16:34:31 Vnode 266: length incorrect; changed from 2589178 to 0
03/20/2011 16:34:31 Vnode 268: length incorrect; changed from 7364175 to 0
03/20/2011 16:34:31 totalInodes 191
03/20/2011 16:34:32 Salvaged root.cell (536870915): 183 files, 37402 blocks
03/20/2011 16:35:24 dispatching child to salvage volume 536870915...
03/20/2011 16:35:24 1 nVolumesInInodeFile 28 
03/20/2011 16:35:25 SALVAGING VOLUME 536870915.
03/20/2011 16:35:25 root.cell (536870915) updated 03/20/2011 12:41
03/20/2011 16:35:25 totalInodes 187
03/20/2011 16:35:25 Salvaged root.cell (536870915): 183 files, 37402 blocks
03/20/2011 16:35:25 The volume header file /vicepa/V0536870916.vol is not associated with any actual data (deleted)
03/20/2011 16:36:17 dispatching child to salvage volume 536871274...
03/20/2011 16:36:17 dispatching child to salvage volume 536871273...
03/20/2011 16:36:17 namei_ListAFSSubDirs: warning: VG 536871274 does not have a link table; salvager will recreate it.
03/20/2011 16:36:17 fileserver requested salvage of clone 536871274; scheduling salvage of volume group 536871273...
03/20/2011 16:36:17 1 nVolumesInInodeFile 28 
03/20/2011 16:36:17 SALVAGING VOLUME 536871273.
03/20/2011 16:36:17 www_logs (536871273) updated 03/20/2011 16:36
03/20/2011 16:36:17 totalInodes 242
03/20/2011 16:36:17 Salvaged www_logs (536871273): 238 files, 243808 blocks
03/20/2011 16:36:17 The volume header file /vicepa/V0536871274.vol is not associated with any actual data (deleted)
03/20/2011 16:40:14 dispatching child to salvage volume 536871273...
03/20/2011 16:40:14 2 nVolumesInInodeFile 56 
03/20/2011 16:40:14 CHECKING CLONED VOLUME 536871274.
03/20/2011 16:40:14 www_logs.readonly (536871274) updated 03/20/2011 16:39
03/20/2011 16:40:14 Vnode 22: length incorrect; (is 1794950 should be 0)
03/20/2011 16:40:14 SALVAGING VOLUME 536871273.
03/20/2011 16:40:14 www_logs (536871273) updated 03/20/2011 16:39
[.. ad infinitum ]

> I am also assuming there's no SalvageLog* files with entries anywhere
> near this time. If that's not true, include those.

Correct

> > Sun Mar 20 04:04:01 2011 VAttachVolume: Error reading diskDataHandle header for vol 536870916; error=101
> > Sun Mar 20 04:04:01 2011 Scheduling salvage for volume 536870916 on part /vicepa over SALVSYNC
> 
> "volume metadata is corrupt for 536870916"

It's a RO clone, should I just delete it and add it back?

> > Sun Mar 20 04:14:14 2011 CopyOnWrite failed: volume 536871273 in partition /vicepa  (tried reading 8192, read 0, wrote 0, errno 0) volume needs salvage
> 
> Someone wrote to a file in 536871273, and we needed to CoW the file. But
> we couldn't read all of the data from the file to make a new copy. Based
> on those values, I'd guess the file is empty on disk, but it's supposed
> to contain some data.
> 
> With a patch on the master branch, such a situation should be caught
> earlier and more information logged. But that wouldn't even really help
> too much if that were the case. You can find files like this by doing
> 'volinfo /vicepa 536871273 -filenames', and seeing the alleged length
> for each vnode, and seeing if the actual file size on disk of the
> reported filenames match.
> 
> However, salvaging is supposed to fix that, so it would be good to see
> what the salvage logs say.

Does this look normal?  I.e., a whole lot of zero-length vnodes attached
to the same vice file.

14720 Vnode 462.6331.1 cloned: 0, length: 3175 linkCount: 1 parent: 1 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/C5+++g94
14784 Vnode 464.6333.1 cloned: 0, length: 210013 linkCount: 1 parent: 1 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/E5+++o94
14848 Vnode 466.6335.1 cloned: 0, length: 54035 linkCount: 1 parent: 1 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/G5+++w94
14912 Vnode 468.6337.1 cloned: 0, length: 4325 linkCount: 1 parent: 1 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/I5+++2A4
14976 Vnode 470.6339.1 cloned: 0, length: 905 linkCount: 1 parent: 1 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/K5+++AA4
15040 Vnode 472.0.0 cloned: 0, length: 0 linkCount: 0 parent: 0 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/+
15104 Vnode 474.0.0 cloned: 0, length: 0 linkCount: 0 parent: 0 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/+
15168 Vnode 476.0.0 cloned: 0, length: 0 linkCount: 0 parent: 0 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/+
15232 Vnode 478.0.0 cloned: 0, length: 0 linkCount: 0 parent: 0 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/+
15296 Vnode 480.0.0 cloned: 0, length: 0 linkCount: 0 parent: 0 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/+
15360 Vnode 482.0.0 cloned: 0, length: 0 linkCount: 0 parent: 0 UFS-Filename: /vicepa/AFSIDat/d=/d3++U/+/+/+
[.. about 300 more lines ]


-- 
Ryan C. Underwood, <nemesis@icequake.net>