[OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

Tue, 1 Mar 2011 22:38:07 -0500 (EST)

On 2011-03-01 at 20:27, Andrew Deason ( adeason@sinenomine.net ) said:
> On Tue, 1 Mar 2011 18:57:50 -0500 (EST)
> Andy Cobaugh <phalenor@gmail.com> wrote:
>
>> This has happened at least once at work on Solaris 10 x86 with a
>> .backup volume, as seen above, and at least once on one of my home
>> machines on 64bit linux with an RO clone.
>
> The volume was probably deleted during the salvage (it was already gone
> by the time of the 'zap -force'), but the fileserver still has the
> volume in an 'error' state.
>
> Could you
>
> volinfo /vicepa 536871061
> fssync-debug query 536871061

# volinfo /vicepcb 536871061
Inode 2305843649298038783: Good magic 78a1b2c5 and version 1
Inode 2305843649365147647: Good magic 99776655 and version 1
Inode 2305843649432256511: Good magic 88664433 and version 1
Inode 2305843641043648511: Good magic 99877712 and version 1
Volume header for volume 536871061 (home.gsong.backup)
stamp.magic = 78a1b2c5, stamp.version = 1
inUse = 0, inService = 0, blessed = 1, needsSalvaged = 0, dontSalvage = 0
type = 2 (backup), uniquifier = 6743359, needsCallback = 0, destroyMe = d3
id = 536871061, parentId = 536871059, cloneId = 0, backupId = 536871061, restoredFromId = 0
maxquota = 134217728, minquota = 0, maxfiles = 0, filecount = 221896, diskused = 56684296
creationDate = 1299021740 (2011/03/01.18:22:20), copyDate = 1299021740 (2011/03/01.18:22:20)
backupDate = 1299021740 (2011/03/01.18:22:20), expirationDate = 0 (1969/12/31.19:00:00)
accessDate = 1299021734 (2011/03/01.18:22:14), updateDate = 1299021636 (2011/03/01.18:20:36)
owner = 1045, accountNumber = 0 
dayUse = 0; week = (0, 0, 0, 0, 0, 0, 0), dayUseDate = 0 (1969/12/31.19:00:00)
volUpdateCounter = 135816

(and I think you meant dafssync-debug. I may not have mentioned that.)

# dafssync-debug query 536871061
calling FSYNC_VolOp with command code 65543 (FSYNC_VOL_QUERY)
FSSYNC service returned 0 (SYNC_OK)
protocol header response code was 0 (SYNC_OK)
protocol reason code was 0 (SYNC_REASON_NONE)
volume = {
         hashid          = 536871061
         header          = 0
         device          = 79
         partition       = 102a75a8
         linkHandle      = 0
         nextVnodeUnique = 0
         diskDataHandle  = 0
         vnodeHashOffset = 79
         shuttingDown    = 0
         goingOffline    = 0
         cacheCheck      = 0
         nUsers          = 0
         needsPutBack    = 0
         specialStatus   = 0
         updateTime      = 0
         vnodeIndex[vSmall] = {
                 handle       = 0
                 bitmap       = 0
                 bitmapSize   = 0
                 bitmapOffset = 0
         }
         vnodeIndex[vLarge] = {
                 handle       = 0
                 bitmap       = 0
                 bitmapSize   = 0
                 bitmapOffset = 0
         }
         updateTime      = 0
         attach_state    = VOL_STATE_ERROR
         attach_flags    = VOL_IN_HASH | VOL_ON_VBYP_LIST
         nWaiters        = 0
         chainCacheCheck = 3
         salvage = {
                 prio      = 0
                 reason    = 0
                 requested = 0
                 scheduled = 0
         }
         stats = {
                 hash_lookups = {
                         hi = 0
                         lo = 155
                 }
                 hash_short_circuits = {
                         hi = 0
                         lo = 0
                 }
                 hdr_loads = {
                         hi = 0
                         lo = 0
                 }
                 hdr_gets = {
                         hi = 0
                         lo = 0
                 }
                 attaches         = 0
                 soft_detaches    = 0
                 salvages         = 16
                 vol_ops          = 1
                 last_attach      = 0
                 last_get         = 0
                 last_promote     = 0
                 last_hdr_get     = 0
                 last_hdr_load    = 0
                 last_salvage     = 1299019004
                 last_salvage_req = 1299018855
                 last_vol_op      = 1299018890
         }
         vlru = {
                 idx = 5 (VLRU_QUEUE_INVALID)
         }
         pending_vol_op  = 0
}

Do you want the .vol file for this volume?

> on the fileserver?
>
> I have an idea on why you can't get the volume usable again, but I have
> no clue as to what the original inconsistency was that caused the first
> salvage.

My suspicion is that a previous 'vos backup' left it in this state. The 
volume group hasn't been touched other than for backups in many months. 
I've never had a problem like this with that fileserver or volume until I 
upgraded from 1.4.11 to 1.6.0pre2.

>> I would have included more snippets from FileLog as well, but 
I have >> the debug level turned up to try to track down a possible
>
> Then you should have some logs mentioning 'FSYNC_com' around 'Tue Mar  1
> 00:02:25 2011' explaining why we refused to give out the volume.  (You
> don't perhaps have that volid header also existing on another partition,
> do you?) Not that that should explain the rest of it.

Only one partition with this volume header on it. Hasn't been moved since 
last March.

First time I see any FSYNC messages is this evening when I tried to fix 
things. I see this line repeated 39 times:

Tue Mar  1 17:34:18 2011 FSYNC_backgroundSalvage: unable to request salvage for volume 536871061

If I look in FileLog.old (I restarted at some point to up the debug 
level), I see these lines:

Tue Mar  1 16:11:34 2011 FSYNC_com:  read failed; dropping connection (cnt=94804)
Tue Mar  1 16:11:34 2011 FSYNC_com:  read failed; dropping connection (cnt=94805)

I would also like to note that the vos backup ocurring Sunday failed with 
a timeout, then succeeded Monday, then failed today.

Command output from vos backup on Sunday:

Failed to end the transaction on the rw volume 536871059 
____: server not responding promptly 
Error in vos backup command.
____: server not responding promptly

Corresponding entries in VolserLog, then the successful vos backup on 
Monday:

Sun Feb 27 00:06:38 2011 trans 22689 on volume 536871059 has been idle for more than 300 seconds
Sun Feb 27 00:07:08 2011 trans 22689 on volume 536871059 has been idle for more than 330 seconds
Sun Feb 27 00:07:38 2011 trans 22689 on volume 536871059 has been idle for more than 360 seconds
Sun Feb 27 00:08:08 2011 trans 22689 on volume 536871059 has been idle for more than 390 seconds
Sun Feb 27 00:08:38 2011 trans 22689 on volume 536871059 has been idle for more than 420 seconds
Sun Feb 27 00:09:08 2011 trans 22689 on volume 536871059 has been idle for more than 450 seconds
Sun Feb 27 00:09:38 2011 trans 22689 on volume 536871059 has been idle for more than 480 seconds
Sun Feb 27 00:10:08 2011 trans 22689 on volume 536871059 has been idle for more than 510 seconds
Sun Feb 27 00:10:38 2011 trans 22689 on volume 536871059 has been idle for more than 540 seconds
Sun Feb 27 00:11:08 2011 trans 22689 on volume 536871059 has been idle for more than 570 seconds
Sun Feb 27 00:11:38 2011 trans 22689 on volume 536871059 has been idle for more than 600 seconds
Sun Feb 27 00:11:38 2011 trans 22689 on volume 536871059 has timed out
...
Mon Feb 28 00:01:38 2011 1 Volser: Clone: Recloning volume 536871059 to volume 536871061

I notice the error from sunday in the remctl logs, and ran vos backup 
manually just to make sure things were ok:

Mon Feb 28 09:24:03 2011 1 Volser: Clone: Recloning volume 536871059 to volume 536871061

--andy