[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

Andrew Deason adeason@sinenomine.net
Tue, 1 Feb 2011 14:58:24 -0600


On Tue, 01 Feb 2011 12:04:08 -0800
Patricia O'Reilly <oreilly@qualcomm.com> wrote:

> From what you have described it sounds to me like you need the patch
> that Andrew referenced earlier that allows you to configure an
> -offline-timeout and -offline-shutdown-timeout option on your
> fileservers. We have has similar problems at our site and will be
> releasing that patch into production shortly.

Maybe, maybe not. I think the most common cause of this is just having
too many volumes that can be shut down in 30 minutes. Determining this
is easy; if it happens every single time you shut down the fileserver,
that's probably it. (But obviously that's not fun to do.)

But it could also be the 1.4.11 host package bugs; I don't know, and I
just noted that cause to illustrate that there are several possible
reasons.

> Jeff Blaine wrote:
> > 
> > Thanks for the replies.
> > 
> > I can't at all fathom that our issue is one of existing
> > client connections and callback break completion (timing out).

I'd only say that if you have pretty good control over all of your
clients. It's possible to see some really bizarre behavior (from the
fileserver's point of view) from old clients or clients on
oddly-behaving networks or NATs.

> >> Also, in this specific case, it may not be just that shutting down
> >> volumes took too long. 1.4.11 has known problems that can cause this
> >> (e.g. the host list gets a loop in it, and something spins forever
> >> trying to traverse the whole list).
> > 
> > That's this, I think?:
> > 
> >     - Fixes to avoid issues cleaning up deleted hosts in
> >       the fileserver (126454)

There were a few issues; all of the ones known to cause problems are
included in 1.4.12. I don't have references for all of them off the top
of my head, but I can get them for you if you want.

> > Let's assume this issue is what caused our problem.  I'm sort of at
> > a loss as to how to approach OpenAFS versions.  On one hand,
> > expectations of more effort to make it clear in the release notes
> > what items could cause something like unclean server shutdowns (kind
> > of a big deal, IMO) are not really justifiable.

This wasn't an issue causing fileserver shutdowns to hang and get
killed, it was a general fileserver stability issue; that hang (or
crash, or however it manifested; I've seen both) could happen at any
time.

And doing something like that actually isn't that difficult for at least
most of the issues I am involved with. I already generally know which
versions are affected for the bigger issues, so just writing that down
would not be that hard. (But going back through all of the changes
between 1.4.Z and 1.4 head would be a lot of work at this point) But
that's not true for all changes, and I think it may be prohibitively
difficult if we had to include information like that with every single
change to the stable branch.

I'm not sure how useful it is, though. In the specific case of the host
list issues, the only meaningful thing I can say is that "sometimes the
fileserver crashes". It's not really possible for you to know how
susceptible you are to it (unless you get hit by it), because the
circumstances required to trigger the crash are rather complex, and they
involve access patterns of clients that you generally cannot control or
even detect.

> > It's open source, etc.  On the other hand, it's not acceptable to
> > blindly upgrade to the latest stable release every time it comes
> > out. I understand that the most obvious take-away is just, "You got
> > bit. Move on.", but if anything can improve on our end, I'd like to
> > do that.

Perhaps not right when it comes out, but it can be a good idea to move
towards them, depending on how you do your risk/change management.
Waiting a bit after each stable release for production machines makes
sense, to see if unknown issues crop up, but if there are significant
issues, you will hear about it if you are paying attention (probably in
the form of a new release, fixing the issue).

1.4.12 was released almost a year ago, and I don't think there are any
significant problems besides the issues that caused 1.4.14 to be
released. There are some smaller issues here and there that sometimes
get hit, but there's no fix on the 1.4.x branch for 1.4.12 that would
cause me to recommend rolling back to pre-1.4.12 if you had upgraded a
machine to 1.4.12.

-- 
Andrew Deason
adeason@sinenomine.net