[OpenAFS] Re: [OpenAFS-devel] 1.6 and post-1.6 OpenAFS branch management and schedule

Christopher D. Clausen cclausen@acm.org
Thu, 17 Jun 2010 13:45:14 -0500


Russ Allbery <rra@stanford.edu> wrote:
> Chris, to check, are you currently using --enable-fast-restart or
> --enable-bitmap-later?

Yes, both of them.

> Please understand that neither of those options are recommended now,
> whether you have DAFS enabled or not.  I consider --enable-fast-restart in
> particular to be dangerous and likely to cause or propagate file
> corruption and would not feel comfortable ever running it in production.
> I know that some people are using the existing implementation and taking
> their chances, and if they're expert AFS administrators and know what
> they're risking, that's fine, but, as I understand it, it's pretty much
> equivalent to disabling fsck and journaling on your file systems after
> crashes and just trusting that there won't be any damage or that, if there
> is, you'll fsck when you notice it.

I have heard that, but I have never experienced any problems myself in many 
years of running that way.  In general the way I see it is that if the power 
goes out, my server stays up for a little longer due to its UPS but the 
network dies immediately so the AFS processes are not doing anything when 
the power finally dies and the server goes down a few minutes later.  (This 
is of course assuming no actual server crashes and luckily I haven't had any 
of those.)

Its fine to not have it enabled by default, but I can't see why one would 
remove the functionality from the source tree.

If you want to require a --yes-i-know-i-can-corrupt-data configure option, 
that is also fine, but requiring source code patches sounds like an major 
annoyance.

-----

I guess I don't understand the particulars of what could happen, but if one 
is really worried about sending corrupt data, wouldn't the best thing to do 
be check the data as it is being sent and return errors then and log that 
something is wrong, not require an ENTIRE VOLUME to be salvaged, leaving all 
of the files inaccessible for a potentially long period of time?  I assume 
that such a thing is not possible to do?

I mean I occationally see NTFS errors in the event log on Windows servers. 
Windows doesn't take the disk offline and run a chkdsk for me to prevent 
potential errors, it allows me to try and access other data and if it works 
there are no problems and denies access to specific files or directories if 
there is corruption.

>> At the same time, I'd be happy to start doing more testing of the
>> various DAFS features, although I'm not quite sure what version I should
>> be using for testing,
>
> If you want to test DAFS, you need to use a 1.5 series server or (coming
> soon) a 1.6 release candidate.

Ah, excellent.  I will wait for a 1.6 release candidate.

Will DAFS be enabled by default in 1.6?  Or is that still being determined?

>> nor am I completely sure how to actually migrate an existing file server
>> to use DAFS or if there is a reverse path to downgrade if I encounter
>> problems.
>
> Migration is documented in the bos_create(8) man page as one of the
> examples.  You can do the inverse procedure to downgrade, although of
> course you'll also need to replace the server binaries with a version
> compiled without demand-attach.

Ok, so http://docs.openafs.org/Reference/8/bos_create.html is the only 
documentation on openafs.org on demand attach?

Ah, I see a http://docs.openafs.org/Reference/8/salvageserver.html as well.

Perhaps a generic dafs man page is in order for us non-developer types to be 
up to speed on what DAFS is, what the benefits are, and how to use it 
correctly?

<<CDC