[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

Steve Simmons scs@umich.edu
Mon, 7 Feb 2011 13:22:21 -0500

On Feb 1, 2011, at 3:58 PM, Andrew Deason wrote:

> On Tue, 01 Feb 2011 12:04:08 -0800
> Patricia O'Reilly <oreilly@qualcomm.com> wrote:
>> =46rom what you have described it sounds to me like you need the =
>> that Andrew referenced earlier that allows you to configure an
>> -offline-timeout and -offline-shutdown-timeout option on your
>> fileservers. We have has similar problems at our site and will be
>> releasing that patch into production shortly.
> Maybe, maybe not. I think the most common cause of this is just having
> too many volumes that can be shut down in 30 minutes. Determining this
> is easy; if it happens every single time you shut down the fileserver,
> that's probably it. (But obviously that's not fun to do.)
> But it could also be the 1.4.11 host package bugs; I don't know, and I
> just noted that cause to illustrate that there are several possible
> reasons.

As noted earlier, we saw this at least back to our use of 1.4.8. Prior =
to that we'd being doing rolling restarts - ie, moving all the volumes =
off a server before restarting it. So it may have been present earlier, =
but we simply didn't hit it.

>> Jeff Blaine wrote:
>>> Thanks for the replies.
>>> I can't at all fathom that our issue is one of existing
>>> client connections and callback break completion (timing out).
> I'd only say that if you have pretty good control over all of your
> clients. It's possible to see some really bizarre behavior (from the
> fileserver's point of view) from old clients or clients on
> oddly-behaving networks or NATs.

Seconded. A number of our more savvy users (or users who have savvy IT =
admins) run AFS at home, another large batch of folks are behind =
nats/firewalls, and a third small group are alumni or ex-staff who use =
their AFS space from all over the world. As a proportion of overall =
users that's fairly small, but as a proportion of folks whose hosts time =
out during shutdown it's pretty large.

>>> Let's assume this issue is what caused our problem.  I'm sort of at
>>> a loss as to how to approach OpenAFS versions.  On one hand,
>>> expectations of more effort to make it clear in the release notes
>>> what items could cause something like unclean server shutdowns (kind
>>> of a big deal, IMO) are not really justifiable.
> This wasn't an issue causing fileserver shutdowns to hang and get
> killed, it was a general fileserver stability issue; that hang (or
> crash, or however it manifested; I've seen both) could happen at any
> time.

There two things which seemed to make the problem more likely - having =
the server up for a long time, and having lots of different hosts using =
volumes from that server. We did find a log entry that was usually a =
symptom of the problem about to occur, but once that entry appeared it =
was too late to fix it - either the server would crash or would get into =
an infinite loop in the next few minutes to hours. Attempting to restart =
the server once we'd seen it always tickled the bug; attaching to the =
process w/gdb and forcing a core dump was how we finally diagnosed the =
bloody thing.

>>> It's open source, etc.  On the other hand, it's not acceptable to
>>> blindly upgrade to the latest stable release every time it comes
>>> out. I understand that the most obvious take-away is just, "You got
>>> bit. Move on.", but if anything can improve on our end, I'd like to
>>> do that.
> Perhaps not right when it comes out, but it can be a good idea to move
> towards them, depending on how you do your risk/change management.
> Waiting a bit after each stable release for production machines makes
> sense, to see if unknown issues crop up, but if there are significant
> issues, you will hear about it if you are paying attention (probably =
> the form of a new release, fixing the issue).
> 1.4.12 was released almost a year ago, and I don't think there are any
> significant problems besides the issues that caused 1.4.14 to be
> released. There are some smaller issues here and there that sometimes
> get hit, but there's no fix on the 1.4.x branch for 1.4.12 that would
> cause me to recommend rolling back to pre-1.4.12 if you had upgraded a
> machine to 1.4.12.

1.4.12 been bery bery good to me; there's no fix in .13/.14 that seems =
to affect us. Right now we're gearing up to build a test host for the =
latest 1.6 release candidate. Barring some disastrous newfound issue =
with 1.4.12, 1.6 makes more sense. As noted earlier in this discussion, =
dynamic attach looks like a fix for shutdown/restart timing issues.