[OpenAFS-devel] Discarding tokens -- is this good?

Robert Banz banz@umbc.edu
Sat, 11 Nov 2006 11:29:30 -0500


I've been trying to debug a problem over the past couple days where  
some machines which have some long-running jobs writing to AFS seem  
to be losing their authentication...

These jobs are all running in PAGs, with a job renewing their tokens  
every so often.  I even have that same job logging the output of  
'tokens' to syslog to verify that there are tokens there, and they  
expire sometime in the way-future.

However, last night, on all of these machines, around the same time,  
they seem to have lost their tokens, and a "Tokens for AFS id... have  
expired" was logged to my syslog... Even though the daemon that was  
running in that process' PAG had logged a message just a few minutes  
earlier that the tokens were due to expire sometime the next day.   
This has been happening off and on for the past couple days, at  
around the same time on all of these machines, in the early morning  
hours...  (this also doesn't make sense, as our default token life is  
24 hours, so if perchance the process renewing tokens wasn't doing  
it's thing, you'd expect the tokens to expire about the time it was  
last started, right?)

Anyhow, I must say I've seen something like this before...  In an  
instance where an AFS fileserver was "slow" responding to requests  
for various reasons, and calls that were in-flight to be processed  
hung out in the fileserver's receive queue for just a little too  
long, and by the time they were processed by the fileserver they were  
too 'old', causing the fileserver to throw an RX error back to all of  
the clients that were involved, causing them to think their tokens  
had expired and discard them...  While this is a problem (no  
fileserver should be so backlogged that this happens), it's also  
quite a bother that a message from a single misbehaving fileserver to  
a client could cause the client to toss it's tokens just on it's good  
word...

I did some looking at the afs/afs_analyze.c code that deals with  
these states.  Would there be some interest in making this handling a  
little smarter?  Perhaps only discarding the tokens if they're  
*actually* expired, and instead allowing the call that returned the  
RX error to either fail or retry?  (Also imagining it'd be darn nice  
to log what fileserver tossed back the error, because right now I'm  
at a loss on that end as well!)

-rob