[OpenAFS-devel] Re: idle dead timeout processing in clients

omalleys@msu.edu omalleys@msu.edu
Fri, 09 Dec 2011 13:27:42 -0500


Quoting omalleys@msu.edu:

> Quoting Russ Allbery <rra@stanford.edu>:
>
>> Andrew Deason <adeason@sinenomine.net> writes:
>>
>>> And, well, "visible" in a different sense. If it takes 20 minutes for a
>>> read() to return, it's not visible in the sense that the application
>>> needs a code path to deal with it; AFS isn't "down" but arguably just
>>> "slow". If it takes 5 seconds for read() to return, but it returns -1
>>> with ETIMEDOUT, for some environments that's worse / more visible. I've
>>> had someone seem completely baffled when they were told that not
>>> everyone runs AFS with hardmount turned on; that not only is that
>>> behavior optional, but defaults to 'off'.
>>
>> Yeah, I suppose it depends on the application.  If your two-week compute
>> job stalls for a half-hour, you might not notice.
>>
>> We mostly use AFS for serving web pages, and if it takes more than twenty
>> seconds, you may as well just give up and return an error message, since
>> you're already past the point of recovery anyway.
>
> The "original" timeout patch/hack, was put in place simply so if you  
> had a web server, it wouldn't lock it up to the point you had you to  
> reboot it if afs was offline for any reason, like maintenance, afs  
> server crash, router crashing, etc.  But 20 minutes was fine, since  
> it is faster then running all over campus to reboot wedged servers.  
> For us, usually it was a scheduled maintenance for AFS, AFS crashed,  
> the router crashed. Typically much bigger issues.
>
> It wasn't really meant to fix issues with people unplugging the  
> network cables, switches, putting up firewalls, Screwing around with  
> the routing, etc. Those are theoretically known localized issues.
>
> For that you need something more robust and better thought out then  
> the original patch. Not just extending the original patch.
>
> I was told by the person who wrote it, it probably wasn't in the  
> right spot, and probably needed to be more well thought out/robust.  
> But it solved the major issue, which we were both having.
>

I also just remembered, There is something about windows xp's network  
stack isn't as robust (nt was worse) and does have a tendency to drop  
packets off the queue (oldest first), versus unix which does just the  
opposite. If you make a request from a windows client to your unix  
server, then the windows network stack gets overloaded with network  
requests or the unix server is overloaded busy already, then windows  
times out the oldest request, and then the unix server tries to  
respond to the old request, which windows disregards as irrelevent  
because it is already off its queue, this could be causing spikes in  
the AFS processes. (I can't remember if this just affects tcp or it  
affects udp also.)