[OpenAFS] Re: New volumes get strange IDs and are unusable

Thu, 13 Oct 2011 12:04:47 +0200

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/11/2011 06:32 PM, Andrew Deason wrote:
> On Mon, 19 Sep 2011 20:22:17 +0200
> Torbjörn Moa <moa@fysik.su.se> wrote:
> 
>>>> : No such device
>>>> Volume does not exist on server sysafs2.physto.se as indicated by th=
e VLDB
>>>
>>> What version? Some things used to have problems with volume IDs over
>>> 2147483648 but I thought we've fixed them all by now.
>>
>> On this particular node we run 1.4.6, but it varies between servers.
> 
> I lied, this bug still exists. At least, it does for me on a 32-bit x86
> host. What platform was this? Through a quirk of atol/atoi it doesn't
> seem to be a problem on amd64 for me, which is probably why I thought i=
t
> wasn't a problem. (gerrit 5594, bug 130266)
> 

Ouff! Good that I didn't panic-update my servers then...

All my servers are 32-bit.

>>> Something bumped the "max volume id" counter in the vldb by a large
>>> number. This could happen in many different ways... unfortuntely, if
>>> you don't have the logging level turned up in the vlserver or have
>>> audit logs turned on, it's going to be difficult to determine what
>>> did it. Do you run any kind of periodic checking for consistency of
>>> volumes vs vldb or anything like that?
>>
>> Hmmm, yes we do. We have a nagios check running on all servers that
>> does a "vos syncserv "$server" -d" and "vos syncvldb "$server"
>> -dryrun" periodically. I guess you are implying I shouldn't do that...
> 
> No, I don't mean to say that, but it's a possible cause. The -dryrun
> option to these does not currently prevent "vos" from raising the max
> volume id in the database. That's a bug, but it's what they currently
> do. It doesn't even print out anything when it does this, so you
> wouldn't know when it happened. (bug 130267)
> 

OK, the nagios checks are still running, and again the problem is back.
The max volume id is now 2267649774. Stupidly, I didn't keep a constant
watch on it after we reset it manually. So, mainly as a test, I will
disable the nagios checks, manually reset the maxvolid again, and then
keep watching it. If it doesn't move then, in a couple of days or so, I
may run the syncvldb and syncserv checks manually, one by one, server by
server, and see what happens. Unless you have some other suggestion.

For me the top prio is to find out what causes this. The problem is not
really that nobody's _telling_ that they're bumping maxvolid, but rather
that it _gets_ bumped in the first place.

Here's the output from "vldb_check -database vldb.DB0 -vheader" on one
of the vldb servers:

- --
Ubik header size is 0 (should be 64)
vldb header
   vldbversion      = 4
   headersize       = 132120 [actual=132120]
   freePtr          = 0x889ec
   eofPtr           = 559744
   allocblock calls = 3055026176
   freeblock  calls = 2769092608
   MaxVolumeId      = 2267649774
   rw vol entries   = 0
   ro vol entries   = 0
   bk vol entries   = 0
   multihome info   = 0x20418 (132120)
   server ip addr   table: size = 255 entries
   volume name hash table: size = 8191 buckets
   volume id   hash table: 3 tables with 8191 buckets each
Header's maximum volume id is 2267649774 and largest id found in VLDB is
536936453
Scanning 3783 entries for possible repairs
- --

Running "vos listvol" on all file servers and sorting the output, I find
the largest volume ID existing on any server is actually 536936451 (a RW
volume), which is consistent with what's in VLDB. So there wouldn't be a
reason for syncvldb (or anyone else for that matter) to bump maxvolid at
all, would there?

Cheers, Torbjörn
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6Wt7UACgkQ0PwHef/zquApRQCfUKSL2j73aNE8WJecqllzUjL+
1+IAn0dKiZ5jgUTVQllRMEafqxWlsIZl
=i94P
-----END PGP SIGNATURE-----