[OpenAFS-devel] fs whereis info doesn't update when afsd is using different repli cate...

aeneous@speakeasy.org aeneous@speakeasy.org
Thu, 10 May 2001 23:40:58 -0400


> Ah, I had thought the whereis results updated when servers were down.
> (i.e. preference dropped when the server was down) A bit of background -
> I'm trying to track down a problem whereby a number of my server-clients
> seem to not be recovering from server failure, even though they seem to
> be hanging on access to read-only replicated volumes.

The preference remains the same; there is another bit in the afs_server struct which indicates up/down state.  This way, after the server returns to service, the preferences are as the user had requested.  

(Interestingly, this is one of the things that MS-Dfs got badly wrong.  When the Microsoft client fails over to a replica location, the original server is never used again (unless all the other replicas fail in turn).  So while you can implement some sort of preference scheme with MS-Dfs, eventually you will wind up stuck using the server which is in Tokyo.  It seems evident that nobody at Microsoft put more than a day's effort into the set of hacks they call Dfs.)

Nathan, the cache manager tries to make periodic GetTime RPC calls to "down" servers.  You can watch them easily with tcpdump, and when you do, you'll see that the call to a truly down server is repeatedly retransmitted and never receives a response.  But the call to an "up" server gets a response in the form of an RX error packet, where the error code is actually the time.  This seems kind of weird, and I have never heard anyone say whether it was intentional or not, but it works fine.

So there are four possible things happening:
1. the GetTime calls to the previously-down-but-now-up server are not returning at all.
2. the GetTime calls are returning a negative error code, which the cache manager interprets as "call timed out".
3. the GetTime calls are returning properly but the cache manager is failing to mark the server as "up".
4. the cache manager is failing to make GetTime calls at all.