Patricia O'Reilly oreilly@qualcomm.com
Tue, 01 Feb 2011 12:04:08 -0800

>From what you have described it sounds to me like you need the patch that Andrew referenced earlier that allows you to configure an -offline-timeout and -offline-shutdown-timeout option on your fileservers. We have has similar problems at our site and will be releasing that patch into production shortly.


Jeff Blaine wrote:
>>>> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
>>>> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
>>>> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
>>>> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to
>>>> shutdown within 1800 seconds
>>>> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9
> Thanks for the replies.
> I can't at all fathom that our issue is one of existing
> client connections and callback break completion (timing out).
>> Also, in this specific case, it may not be just that shutting down
>> volumes took too long. 1.4.11 has known problems that can cause this
>> (e.g. the host list gets a loop in it, and something spins forever
>> trying to traverse the whole list).
> That's this, I think?:
>     - Fixes to avoid issues cleaning up deleted hosts in
>       the fileserver (126454)
> Let's assume this issue is what caused our problem.  I'm sort
> of at a loss as to how to approach OpenAFS versions.  On one
> hand, expectations of more effort to make it clear in the
> release notes what items could cause something like unclean
> server shutdowns (kind of a big deal, IMO) are not really
> justifiable.  It's open source, etc.  On the other hand,
> it's not acceptable to blindly upgrade to the latest stable
> release every time it comes out.  I understand that the most
> obvious take-away is just, "You got bit.  Move on.", but
> if anything can improve on our end, I'd like to do that.
> I welcome any suggestions for how others are approaching this.
> Jeff Blaine
