[OpenAFS] Re: Need volume state / fileserver / salvage knowledge
Tue, 01 Feb 2011 12:04:08 -0800
>From what you have described it sounds to me like you need the patch that Andrew referenced earlier that allows you to configure an -offline-timeout and -offline-shutdown-timeout option on your fileservers. We have has similar problems at our site and will be releasing that patch into production shortly.
Jeff Blaine wrote:
>>>> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
>>>> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
>>>> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
>>>> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to
>>>> shutdown within 1800 seconds
>>>> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9
> Thanks for the replies.
> I can't at all fathom that our issue is one of existing
> client connections and callback break completion (timing out).
>> Also, in this specific case, it may not be just that shutting down
>> volumes took too long. 1.4.11 has known problems that can cause this
>> (e.g. the host list gets a loop in it, and something spins forever
>> trying to traverse the whole list).
> That's this, I think?:
> - Fixes to avoid issues cleaning up deleted hosts in
> the fileserver (126454)
> Let's assume this issue is what caused our problem. I'm sort
> of at a loss as to how to approach OpenAFS versions. On one
> hand, expectations of more effort to make it clear in the
> release notes what items could cause something like unclean
> server shutdowns (kind of a big deal, IMO) are not really
> justifiable. It's open source, etc. On the other hand,
> it's not acceptable to blindly upgrade to the latest stable
> release every time it comes out. I understand that the most
> obvious take-away is just, "You got bit. Move on.", but
> if anything can improve on our end, I'd like to do that.
> I welcome any suggestions for how others are approaching this.
> Jeff Blaine
> OpenAFS-info mailing list