[OpenAFS] Re: - Locked volumes

ProbaNet info@probanet.it
Wed, 07 Mar 2012 10:10:51 +0100


Il giorno mar, 06/03/2012 alle 10.50 -0600, Andrew Deason ha scritto:
> On Tue, 06 Mar 2012 12:12:52 +0100
> ProbaNet <info@probanet.it> wrote:
> 
> > 	We have a locked volume and we are unable to unlock it (it's an
> > important one, it stores a big list of dirs / mountpoints for other
> > volumes). It has been locked during a release operation which is now
> > aborted due to a failure. We tried:
> > - vos unlock vol -verbose [CTRL+C after 20+ minutes]
> > - vos unlockvldb -verbose [CTRL+C after 20+ minutes]
> > Both commands failed (they wait forever with no output). The logs seems
> > to be "ok" (no errors in VLLog / VolserLog / etc..). We have the quorum
> 
> You get absolutely no output? Not even "Binding to the VLDB server" ?

Ops, you're right: with 'unlockvldb' there was that line of output,
"Binding to the VLDB server", but then nothing else (also in the logs,
only messages about elections).

> What happens if you try with -localauth or -noauth ?

Same results with -localauth, not tested with -noauth.

> What version and platform is this? Can you run 'pstack <vos pid>' when
> it's stuck?

Too late for the pstack (problem solved for now).. :)
Dbservers are debian lenny + backports (afsrm1), debian squeeze (afsmn1,
afsmn3, afsor1) and gentoo (afsmn2).
All x86_64 with openafs 1.4.x :
for s in afsmn1 afsmn2 afsmn3 afsrm1 afsor1; do rxdebug $s 7003 -version
|grep AFS; done
AFS version:  OpenAFS 1.4.12.1 built  2011-02-09
AFS version:  OpenAFS 1.4.14 built  2011-01-31
AFS version:  OpenAFS 1.4.12.1 built  2011-02-09
AFS version:  OpenAFS 1.4.12.1 built  2011-02-22
AFS version:  OpenAFS 1.4.12.1 built  2011-02-09

After a while we found the real problem with "udebug afsmn1 vlserver".
Quorum OK (all servers vote yes for afsmn1), but different db version
for server afsrm1 (dbcurrent=0, up=1 beaconSince=1). Recovery state "f".
No propagation triggered.. We don't understand why..

In order to quickly solve the problem we did the following (from
afsmn1):
- bos stop afsrm1 vlserver
- scp /var/lib/openafs/db/vldb.DB0 afsrm1:/var/lib/openafs/db/
- bos restart afsmn1 vlserver [waited until quorum OK]
- bos start afsrm1 vlserver
Now "udebug afsmn1 vlserver" is perfect (Revovery state 1f, dbcurrent=1
for all, same db version for all), we could unlock the volume (and the
vlserver) and we could create / remove volumes and perform normal
operations.

But the problem is not solved.. To test the situation we tried:
- bos stop afsrm1 vlserver
- vos create afsmn1 a test_vol [slow, but worked]
- udebug afsmn1 vlserver (db version increased in all servers but
afsrm1, as expected)
- bos start afsrm1 vlserver
- udebug afsmn1 vlserver
At this point we expected a db propagation to afsrm1.. But nothing
happened in 1 hour, nothing in the logs, dbcurrent=0, different db
versions and vldb frozen again.. (solved again with the scp method
described above).
Any suggestion? :) Thank you very much for your help!

Stefano
Fabio

P.S.: we are planning to turn afsrm1 and afsor1 (actually regular voting
dbservers) into non-voting clone-servers: is that a simple task? Any
suggestion to do that? Thanks again!