[OpenAFS] bos backupsys silently failing for one or two volumes?

Mon, 16 Jul 2012 15:07:10 -0400

It appears that periodically bos backupsys silently fails for a volume. =
I'm curious if others have seen the same issue.

We recently updated some of our AFS data gathering and internal health =
scripts to report on volumes that had not had a recent .backup volume =
created. This has mostly been a win, as it finds volumes which were left =
in a locked state, etc, etc. However, it has uncovered another problem =
where bos backupsys seems to miss a volume occasionally.

Environment - running a linux from scratch, openafs 1.4.12. We have 30 =
servers, with about 275,000 RW volumes, ie, not counting .backups and =
.readonly volumes.

Bos config for backupsys:

  bnode cron snapshot 1
  parm /usr/sbin/vos backupsys -se localhost -localauth
  parm 18:00
  end

Our new volume health checker was implemented near the end of May. It =
runs about 14 hours after the backupsys. If it sees a volume with =
.backup older than 36 hours, it will report that old backup as

   g.presoff.smartgl: 1 days, 12 hrs, 32 min

This particular volume came up on June 7. vos examine -fo verified the =
.backup being older than expected. We did a manual vos backup on it and =
was fine for a few days. It occurred again on the same value June 12, we =
did a manual backup again. On Jun 19 it hit again. This time we =
deliberately didn't do a manual backup, the next day it reported being =
an additional 24 hours out of date. We forced a backup. On Jun 22 it =
again reported being out of date, we let it go and it came up again on =
the 23rd. At that point we vos moved it from one partition to another on =
the same server, then back to the original partition. The next day, it =
once again had not gotten a .backup, but the next day it did (weekend). =
On July 5 it again had an old .backup. We then moved it to another =
server, and it has not yet thrown problems. We have had several other =
volumes generate similar errors, tho none this persistent.

The three volumes involved were on three different file servers at the =
time that the .backup volumes were not created. One of them was migrated =
to a different server and ceased being a problem, one threw only one =
error and has not complained again. All three of the servers involved =
were of the same type (hardware configuration); there are four others of =
that type active, none of which have shown any errors.

Through all of this, there were no entries in BosLog, FileLog or =
VolserLog show any reference to the volume aside from normal access =
stuff. The volumes were not locked, and normal backups (vos dump) worked =
without errors. No kernel or daemon messages were generated about disk =
problems.

Anybody seen anything like this on their systems?

Steve=