[OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
Andy Cobaugh
phalenor@gmail.com
Tue, 1 Mar 2011 22:38:07 -0500 (EST)
On 2011-03-01 at 20:27, Andrew Deason ( adeason@sinenomine.net ) said:
> On Tue, 1 Mar 2011 18:57:50 -0500 (EST)
> Andy Cobaugh <phalenor@gmail.com> wrote:
>
>> This has happened at least once at work on Solaris 10 x86 with a
>> .backup volume, as seen above, and at least once on one of my home
>> machines on 64bit linux with an RO clone.
>
> The volume was probably deleted during the salvage (it was already gone
> by the time of the 'zap -force'), but the fileserver still has the
> volume in an 'error' state.
>
> Could you
>
> volinfo /vicepa 536871061
> fssync-debug query 536871061
# volinfo /vicepcb 536871061
Inode 2305843649298038783: Good magic 78a1b2c5 and version 1
Inode 2305843649365147647: Good magic 99776655 and version 1
Inode 2305843649432256511: Good magic 88664433 and version 1
Inode 2305843641043648511: Good magic 99877712 and version 1
Volume header for volume 536871061 (home.gsong.backup)
stamp.magic = 78a1b2c5, stamp.version = 1
inUse = 0, inService = 0, blessed = 1, needsSalvaged = 0, dontSalvage = 0
type = 2 (backup), uniquifier = 6743359, needsCallback = 0, destroyMe = d3
id = 536871061, parentId = 536871059, cloneId = 0, backupId = 536871061, restoredFromId = 0
maxquota = 134217728, minquota = 0, maxfiles = 0, filecount = 221896, diskused = 56684296
creationDate = 1299021740 (2011/03/01.18:22:20), copyDate = 1299021740 (2011/03/01.18:22:20)
backupDate = 1299021740 (2011/03/01.18:22:20), expirationDate = 0 (1969/12/31.19:00:00)
accessDate = 1299021734 (2011/03/01.18:22:14), updateDate = 1299021636 (2011/03/01.18:20:36)
owner = 1045, accountNumber = 0
dayUse = 0; week = (0, 0, 0, 0, 0, 0, 0), dayUseDate = 0 (1969/12/31.19:00:00)
volUpdateCounter = 135816
(and I think you meant dafssync-debug. I may not have mentioned that.)
# dafssync-debug query 536871061
calling FSYNC_VolOp with command code 65543 (FSYNC_VOL_QUERY)
FSSYNC service returned 0 (SYNC_OK)
protocol header response code was 0 (SYNC_OK)
protocol reason code was 0 (SYNC_REASON_NONE)
volume = {
hashid = 536871061
header = 0
device = 79
partition = 102a75a8
linkHandle = 0
nextVnodeUnique = 0
diskDataHandle = 0
vnodeHashOffset = 79
shuttingDown = 0
goingOffline = 0
cacheCheck = 0
nUsers = 0
needsPutBack = 0
specialStatus = 0
updateTime = 0
vnodeIndex[vSmall] = {
handle = 0
bitmap = 0
bitmapSize = 0
bitmapOffset = 0
}
vnodeIndex[vLarge] = {
handle = 0
bitmap = 0
bitmapSize = 0
bitmapOffset = 0
}
updateTime = 0
attach_state = VOL_STATE_ERROR
attach_flags = VOL_IN_HASH | VOL_ON_VBYP_LIST
nWaiters = 0
chainCacheCheck = 3
salvage = {
prio = 0
reason = 0
requested = 0
scheduled = 0
}
stats = {
hash_lookups = {
hi = 0
lo = 155
}
hash_short_circuits = {
hi = 0
lo = 0
}
hdr_loads = {
hi = 0
lo = 0
}
hdr_gets = {
hi = 0
lo = 0
}
attaches = 0
soft_detaches = 0
salvages = 16
vol_ops = 1
last_attach = 0
last_get = 0
last_promote = 0
last_hdr_get = 0
last_hdr_load = 0
last_salvage = 1299019004
last_salvage_req = 1299018855
last_vol_op = 1299018890
}
vlru = {
idx = 5 (VLRU_QUEUE_INVALID)
}
pending_vol_op = 0
}
Do you want the .vol file for this volume?
> on the fileserver?
>
> I have an idea on why you can't get the volume usable again, but I have
> no clue as to what the original inconsistency was that caused the first
> salvage.
My suspicion is that a previous 'vos backup' left it in this state. The
volume group hasn't been touched other than for backups in many months.
I've never had a problem like this with that fileserver or volume until I
upgraded from 1.4.11 to 1.6.0pre2.
>> I would have included more snippets from FileLog as well, but
I have >> the debug level turned up to try to track down a possible
>
> Then you should have some logs mentioning 'FSYNC_com' around 'Tue Mar 1
> 00:02:25 2011' explaining why we refused to give out the volume. (You
> don't perhaps have that volid header also existing on another partition,
> do you?) Not that that should explain the rest of it.
Only one partition with this volume header on it. Hasn't been moved since
last March.
First time I see any FSYNC messages is this evening when I tried to fix
things. I see this line repeated 39 times:
Tue Mar 1 17:34:18 2011 FSYNC_backgroundSalvage: unable to request salvage for volume 536871061
If I look in FileLog.old (I restarted at some point to up the debug
level), I see these lines:
Tue Mar 1 16:11:34 2011 FSYNC_com: read failed; dropping connection (cnt=94804)
Tue Mar 1 16:11:34 2011 FSYNC_com: read failed; dropping connection (cnt=94805)
I would also like to note that the vos backup ocurring Sunday failed with
a timeout, then succeeded Monday, then failed today.
Command output from vos backup on Sunday:
Failed to end the transaction on the rw volume 536871059
____: server not responding promptly
Error in vos backup command.
____: server not responding promptly
Corresponding entries in VolserLog, then the successful vos backup on
Monday:
Sun Feb 27 00:06:38 2011 trans 22689 on volume 536871059 has been idle for more than 300 seconds
Sun Feb 27 00:07:08 2011 trans 22689 on volume 536871059 has been idle for more than 330 seconds
Sun Feb 27 00:07:38 2011 trans 22689 on volume 536871059 has been idle for more than 360 seconds
Sun Feb 27 00:08:08 2011 trans 22689 on volume 536871059 has been idle for more than 390 seconds
Sun Feb 27 00:08:38 2011 trans 22689 on volume 536871059 has been idle for more than 420 seconds
Sun Feb 27 00:09:08 2011 trans 22689 on volume 536871059 has been idle for more than 450 seconds
Sun Feb 27 00:09:38 2011 trans 22689 on volume 536871059 has been idle for more than 480 seconds
Sun Feb 27 00:10:08 2011 trans 22689 on volume 536871059 has been idle for more than 510 seconds
Sun Feb 27 00:10:38 2011 trans 22689 on volume 536871059 has been idle for more than 540 seconds
Sun Feb 27 00:11:08 2011 trans 22689 on volume 536871059 has been idle for more than 570 seconds
Sun Feb 27 00:11:38 2011 trans 22689 on volume 536871059 has been idle for more than 600 seconds
Sun Feb 27 00:11:38 2011 trans 22689 on volume 536871059 has timed out
...
Mon Feb 28 00:01:38 2011 1 Volser: Clone: Recloning volume 536871059 to volume 536871061
I notice the error from sunday in the remctl logs, and ran vos backup
manually just to make sure things were ok:
Mon Feb 28 09:24:03 2011 1 Volser: Clone: Recloning volume 536871059 to volume 536871061
--andy