[OpenAFS] Re: DAFS dasalvager: cannot be running from cron

Andrew Deason adeason@sinenomine.net
Mon, 8 Jul 2013 11:38:41 -0500


On Mon, 8 Jul 2013 08:14:29 +0000
"Brunckhorst, Ralf" <ralf.brunckhorst@hp.com> wrote:

> Is there any chance to get this also running via cron?

Short answer: dasalvager is apparently unsafe for single-volume
salvages; avoid it for now if you can. Running 'salvageserver -client'
is another way to manually salvage a single volume, which does not have
this problem.

Longer answer:

This is a bug in dasalvager where it is not initializing the structure
properly for locking volumes on disk. So, it thinks it has fd 0 already
open for locking, and tries to use that fd without opening the proper
file. When you run under a terminal, it uses whatever happens to be fd 0
(this is obviously not safe/correct). When you run under cron (or 'at'
or probably a number of other things), fd 0 happens to be closed, so we
fail. It is very fortunate that it does fail, because otherwise I don't
know when we would have discovered this.

Because of all of that, even when dasalvager seems to be running along
fine, it is not accessing volumes in a safe manner (this is a bug; the
same bug). So, in certain edge cases, it would be possible for
dasalvager to cause corruption of volumes. While I haven't completely
thought through the scenarios, I believe this would only cause whole
volumes to become unusable due to volume metadata problems; it wouldn't
corrupt data inside volumes or anything like that. If you don't care, or
if this server is relatively inactive or is otherwise low-risk for some
reason, you could work around this by forcing fd 0 to be open when you
run dasalvager from cron. Obviously, I don't really recommend that, but
it should be possible.

What would probably be better is if you don't use dasalvager for
single-volume salvages at all until we can get this fixed. If you must
manually cause a single-volume salvage, you can run 'salvageserver
-client' instead of dasalvager, which uses the same code paths as
demand-salvages.

>From a developer perspective:

This is due to the DAFS_FS / DAFS_UTIL mess; all of the DAFS stuff in
partition.c should be for _UTIL or _FS instead of just _FS. I started
going through fixing them, but there are so many cases that seem to need
fixing (e.g. all vutil.c references), and it's becoming clear that the
current scheme is becoming increasingly... ridiculously cumbersome and
error-prone. I can think of a few possible ways of improving this:

 - Make DAFS_FS imply DAFS_UTIL. We still have to go manually fix all of
   the instances to see if they really are _FS or should be _UTIL, but
   at least it's less verbose.

 - Remove DAFS_UTIL, make all DAFS utilities use DAFS_FS, and fix
   existing code to handle the non-pthread DAFS_FS case.
 
 - Just use DAFS_FS everywhere and make all DAFS utilities pthreaded,
   reducing the number of different codepaths. Is there any reason we
   were avoiding this? It's not like we need LWP DAFS, and any further
   granularity of "what type of program is running" should be handled at
   runtime by programType anyway.

I'll probably try the latter and see if it's more work than I thought or
if I forgot about some big blocker issue for that. Just mentioning
options here if anyone has opinions on it.

-- 
Andrew Deason
adeason@sinenomine.net