[OpenAFS-devel] client-idledeadtime-support-20080430 causes connection time outs

Fri, 4 Sep 2009 11:53:20 +0200

I believe client-idledeadtime-support-20080430 is a bit unfair, at least it 
introduced the following problem, possibly on a wrong assumption:

time 502.579567, pid 2195: Analyze RPC op 6 conn 0x4ae0b240 code 0xfffffffd 
user 0x0
time 502.579581, pid 2195: afs_Analyze out shouldRetry 0
time 502.579608, pid 2195: Returning code -3 from 21

in words: an RXAFS_RemoveFile received an RX_CALL_TIMEOUT (-3), there is no 
alternative server, and hence the operation is not retried. It would have been 
retried in the RX_CALL_DEAD (-1) case (and plenty of other cases).

The comment in the code justifies that special treatment for RX_CALL_TIMEOUT 
on the grounds that the call has timed out while server was still responding 
to other calls. From the RX code I fail to see how this is necessarily 
correct: rxi_CheckCall can be called from the keepalive mechanism and stop the 
call without any interaction from the server, you can get RX_CALL_TIMEOUT and 
not e.g. RX_CALL_DEAD even if the server is completely hosed.

In this particular case the server was stopped, the "hard mount" functionality 
should have ensured that I/Os stay pending until the server was restarted.

Now, I don't know how to solve the yoyo up-down-up problem in case a server 
times out calls selectively. When it happens, for single-server-volumes the 
call should be retried in most cases.

Would it be reasonable to blacklist only up to the last server and then go the 
usual path which ensures a fair retry?

(Actually, I wonder whether you can ensure you never get RX_CALL_TIMEOUT if a 
server is simply restarted, which certainly deserves a retry).

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155