[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

Jeff Blaine jblaine@kickflop.net
Tue, 01 Feb 2011 14:39:18 -0500


>>> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
>>> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
>>> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
>>> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within 1800 seconds
>>> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9

Thanks for the replies.

I can't at all fathom that our issue is one of existing
client connections and callback break completion (timing out).

 > Also, in this specific case, it may not be just that shutting down
 > volumes took too long. 1.4.11 has known problems that can cause this
 > (e.g. the host list gets a loop in it, and something spins forever
 > trying to traverse the whole list).

That's this, I think?:

     - Fixes to avoid issues cleaning up deleted hosts in
       the fileserver (126454)

Let's assume this issue is what caused our problem.  I'm sort
of at a loss as to how to approach OpenAFS versions.  On one
hand, expectations of more effort to make it clear in the
release notes what items could cause something like unclean
server shutdowns (kind of a big deal, IMO) are not really
justifiable.  It's open source, etc.  On the other hand,
it's not acceptable to blindly upgrade to the latest stable
release every time it comes out.  I understand that the most
obvious take-away is just, "You got bit.  Move on.", but
if anything can improve on our end, I'd like to do that.

I welcome any suggestions for how others are approaching this.

Jeff Blaine