[OpenAFS-devel] Re: idle dead timeout processing in clients

omalleys@msu.edu omalleys@msu.edu
Fri, 09 Dec 2011 09:02:08 -0500


Quoting Russ Allbery <rra@stanford.edu>:

> Andrew Deason <adeason@sinenomine.net> writes:
>
>> And, well, "visible" in a different sense. If it takes 20 minutes for a
>> read() to return, it's not visible in the sense that the application
>> needs a code path to deal with it; AFS isn't "down" but arguably just
>> "slow". If it takes 5 seconds for read() to return, but it returns -1
>> with ETIMEDOUT, for some environments that's worse / more visible. I've
>> had someone seem completely baffled when they were told that not
>> everyone runs AFS with hardmount turned on; that not only is that
>> behavior optional, but defaults to 'off'.
>
> Yeah, I suppose it depends on the application.  If your two-week compute
> job stalls for a half-hour, you might not notice.
>
> We mostly use AFS for serving web pages, and if it takes more than twenty
> seconds, you may as well just give up and return an error message, since
> you're already past the point of recovery anyway.

The "original" timeout patch/hack, was put in place simply so if you  
had a web server, it wouldn't lock it up to the point you had you to  
reboot it if afs was offline for any reason, like maintenance, afs  
server crash, router crashing, etc.  But 20 minutes was fine, since it  
is faster then running all over campus to reboot wedged servers. For  
us, usually it was a scheduled maintenance for AFS, AFS crashed, the  
router crashed. Typically much bigger issues.

It wasn't really meant to fix issues with people unplugging the  
network cables, switches, putting up firewalls, Screwing around with  
the routing, etc. Those are theoretically known localized issues.

For that you need something more robust and better thought out then  
the original patch. Not just extending the original patch.

I was told by the person who wrote it, it probably wasn't in the right  
spot, and probably needed to be more well thought out/robust. But it  
solved the major issue, which we were both having.