[OpenAFS-devel] Discarding tokens -- is this good?
Robert Banz
banz@umbc.edu
Sat, 11 Nov 2006 11:29:30 -0500
I've been trying to debug a problem over the past couple days where
some machines which have some long-running jobs writing to AFS seem
to be losing their authentication...
These jobs are all running in PAGs, with a job renewing their tokens
every so often. I even have that same job logging the output of
'tokens' to syslog to verify that there are tokens there, and they
expire sometime in the way-future.
However, last night, on all of these machines, around the same time,
they seem to have lost their tokens, and a "Tokens for AFS id... have
expired" was logged to my syslog... Even though the daemon that was
running in that process' PAG had logged a message just a few minutes
earlier that the tokens were due to expire sometime the next day.
This has been happening off and on for the past couple days, at
around the same time on all of these machines, in the early morning
hours... (this also doesn't make sense, as our default token life is
24 hours, so if perchance the process renewing tokens wasn't doing
it's thing, you'd expect the tokens to expire about the time it was
last started, right?)
Anyhow, I must say I've seen something like this before... In an
instance where an AFS fileserver was "slow" responding to requests
for various reasons, and calls that were in-flight to be processed
hung out in the fileserver's receive queue for just a little too
long, and by the time they were processed by the fileserver they were
too 'old', causing the fileserver to throw an RX error back to all of
the clients that were involved, causing them to think their tokens
had expired and discard them... While this is a problem (no
fileserver should be so backlogged that this happens), it's also
quite a bother that a message from a single misbehaving fileserver to
a client could cause the client to toss it's tokens just on it's good
word...
I did some looking at the afs/afs_analyze.c code that deals with
these states. Would there be some interest in making this handling a
little smarter? Perhaps only discarding the tokens if they're
*actually* expired, and instead allowing the call that returned the
RX error to either fail or retry? (Also imagining it'd be darn nice
to log what fileserver tossed back the error, because right now I'm
at a loss on that end as well!)
-rob