[OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
Andy Cobaugh
phalenor@gmail.com
Fri, 4 Mar 2011 16:23:34 -0500 (EST)
Ok, an update to the problem I alluded to this morning.
Volume name in question is pub.m.rpmforge. The .backup volume in
particular. This volume was backup'd this morning at approx. 0005, with
this output from vos backup:
Failed to end the transaction on the rw volume 536873153
____: server not responding promptly
Error in vos backup command.
____: server not responding promptly
That command returned in <5s. I then see this in VolserLog:
Fri Mar 4 00:05:15 2011 1 Volser: Clone: Recloning volume 536873153 to volume 536873155
Fri Mar 4 00:10:19 2011 trans 13950 on volume 536873153 has been idle for more than 300 seconds
Fri Mar 4 00:10:49 2011 trans 13950 on volume 536873153 has been idle for more than 330 seconds
Fri Mar 4 00:11:19 2011 trans 13950 on volume 536873153 has been idle for more than 360 seconds
Fri Mar 4 00:11:49 2011 trans 13950 on volume 536873153 has been idle for more than 390 seconds
Fri Mar 4 00:12:19 2011 trans 13950 on volume 536873153 has been idle for more than 420 seconds
Fri Mar 4 00:12:49 2011 trans 13950 on volume 536873153 has been idle for more than 450 seconds
Fri Mar 4 00:13:19 2011 trans 13950 on volume 536873153 has been idle for more than 480 seconds
Fri Mar 4 00:13:49 2011 trans 13950 on volume 536873153 has been idle for more than 510 seconds
Fri Mar 4 00:14:19 2011 trans 13950 on volume 536873153 has been idle for more than 540 seconds
Fri Mar 4 00:14:49 2011 trans 13950 on volume 536873153 has been idle for more than 570 seconds
Fri Mar 4 00:15:19 2011 trans 13950 on volume 536873153 has been idle for more than 600 seconds
Fri Mar 4 00:15:19 2011 trans 13950 on volume 536873153 has timed out
Nothing in any of the other log files, and nothing interesting in FileLog
other than:
Mar 4 00:05:15 horvitz fileserver[2236]: VOffline: Volume 536873153 (pub.m.rpmforge) is now offline (A volume utility is running.)
Mar 4 00:05:15 horvitz fileserver[2236]: fssync: breaking all call backs for volume 536873155
(and then tsm goes to access the RW volume, at which point I guess it's
brought back online)
Mar 4 01:00:31 horvitz fileserver[2236]: SAFS_FetchStatus, Fid = 536873153.1.1, Host 128.118.200.6:7001, Id 117
Mar 4 01:00:31 horvitz fileserver[2236]: VOnline: volume 536873153 (pub.m.rpmforge) attached and online
I noticed this when nagios reported that one of the volumes on this server
was marked off-line.
Now, interestingly, I just ran another vos backup against the same volume:
$ vos backup pub.m.rpmforge
Created backup volume for pub.m.rpmforge
Fri Mar 4 16:05:14 2011 1 Volser: Clone: Recloning volume 536873153 to volume 536873155
The pub.m.rpmforge.backup is now on-line.
Subsequent backups seem to be fine.
I'm not sure how related this is to the other issue I saw, where the
backup clone was left in a much worse state. The immediate effects of the
vos backup are the same, but I'm still not sure what caused the demand
salvage of the volume later during the day in that case. In that case,
that was a home directory that was very much in use, and something
triggered the salvage, there's just nothing in the logs to indicate why.
--andy