[OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

Andy Cobaugh phalenor@gmail.com
Fri, 4 Mar 2011 16:23:34 -0500 (EST)


Ok, an update to the problem I alluded to this morning.

Volume name in question is pub.m.rpmforge. The .backup volume in 
particular. This volume was backup'd this morning at approx. 0005, with 
this output from vos backup:

Failed to end the transaction on the rw volume 536873153 
____: server not responding promptly
Error in vos backup command. 
____: server not responding promptly

That command returned in <5s. I then see this in VolserLog:

Fri Mar  4 00:05:15 2011 1 Volser: Clone: Recloning volume 536873153 to volume 536873155
Fri Mar  4 00:10:19 2011 trans 13950 on volume 536873153 has been idle for more than 300 seconds
Fri Mar  4 00:10:49 2011 trans 13950 on volume 536873153 has been idle for more than 330 seconds
Fri Mar  4 00:11:19 2011 trans 13950 on volume 536873153 has been idle for more than 360 seconds
Fri Mar  4 00:11:49 2011 trans 13950 on volume 536873153 has been idle for more than 390 seconds
Fri Mar  4 00:12:19 2011 trans 13950 on volume 536873153 has been idle for more than 420 seconds
Fri Mar  4 00:12:49 2011 trans 13950 on volume 536873153 has been idle for more than 450 seconds
Fri Mar  4 00:13:19 2011 trans 13950 on volume 536873153 has been idle for more than 480 seconds
Fri Mar  4 00:13:49 2011 trans 13950 on volume 536873153 has been idle for more than 510 seconds
Fri Mar  4 00:14:19 2011 trans 13950 on volume 536873153 has been idle for more than 540 seconds
Fri Mar  4 00:14:49 2011 trans 13950 on volume 536873153 has been idle for more than 570 seconds
Fri Mar  4 00:15:19 2011 trans 13950 on volume 536873153 has been idle for more than 600 seconds
Fri Mar  4 00:15:19 2011 trans 13950 on volume 536873153 has timed out

Nothing in any of the other log files, and nothing interesting in FileLog 
other than:

Mar  4 00:05:15 horvitz fileserver[2236]: VOffline: Volume 536873153 (pub.m.rpmforge) is now offline (A volume utility is running.)
Mar  4 00:05:15 horvitz fileserver[2236]: fssync: breaking all call backs for volume 536873155

(and then tsm goes to access the RW volume, at which point I guess it's 
brought back online)

Mar  4 01:00:31 horvitz fileserver[2236]: SAFS_FetchStatus,  Fid = 536873153.1.1, Host 128.118.200.6:7001, Id 117
Mar  4 01:00:31 horvitz fileserver[2236]: VOnline:  volume 536873153 (pub.m.rpmforge) attached and online

I noticed this when nagios reported that one of the volumes on this server 
was marked off-line.

Now, interestingly, I just ran another vos backup against the same volume:

$ vos backup pub.m.rpmforge
Created backup volume for pub.m.rpmforge

Fri Mar  4 16:05:14 2011 1 Volser: Clone: Recloning volume 536873153 to volume 536873155

The pub.m.rpmforge.backup is now on-line.

Subsequent backups seem to be fine.

I'm not sure how related this is to the other issue I saw, where the 
backup clone was left in a much worse state. The immediate effects of the 
vos backup are the same, but I'm still not sure what caused the demand 
salvage of the volume later during the day in that case. In that case, 
that was a home directory that was very much in use, and something 
triggered the salvage, there's just nothing in the logs to indicate why.

--andy