[OpenAFS] sudden crash
Tue, 30 Jul 2002 10:00:18 -0500
On our internal builds, we typically lower both the check servers
interval and the rx_deadtime settings to much lower numbers to speed
recovery. Doesn't always help though - there do appear to be a number of
cases where it has a hard time seeing that the server has stopped
Nathan Neulinger EMail: email@example.com
University of Missouri - Rolla Phone: (573) 341-4841
Computing Services Fax: (573) 341-4216
> -----Original Message-----
> From: Nickolai Zeldovich [mailto:kolya@MIT.EDU]=20
> Sent: Tuesday, July 30, 2002 9:52 AM
> To: OpenAFSfirstname.lastname@example.org
> Subject: Re: [OpenAFS] sudden crash
> > what is the expected behavior if i'm reading a big
> > file from a replicated RO-Volume and the server
> > (I'm actually reading from) crashes?
> The client will time out on the read RPC and try another one of the
> read-only replicas.
> > How long will the cachemanager wait until he
> > decides to choose another server where a
> > RO-copy resides?
> AFS_RXDEADTIME, declared in src/afs/afs.h, which is 50 seconds.
> > Will the cachemanager be able to decide that
> > it's time to use the RW of that volume
> > (if no more replicas are available)?
> The cache manager will never fall back to RW volumes, but remember
> that you get a "free" replica of the RW volume on the same partition
> as it resides (as long as you add the replica on the same partition,
> it will not take up any additional disk space, being copy-on-write).
> > Will the cachemanager read the whole file again
> > or just the part not read yet?
> It will not re-fetch the chunks it already fetched successfully.
> > How often does the cachemanager check if the
> > crashed server is available again?
> The afs_CheckServerDaemon() thread tries to check each server
> every PROBE_INTERVAL seconds (180 by default); in practice it
> ends up being a little more than 180 seconds.
> > What if the crashed server was the DB-Server
> > currently used (another one is available)?
> > How long will the cachemanager try to contact
> > the crashed server until he decides to
> > choose the other one?
> Same as for file servers; AFS_RXDEADTIME, which is 50 seconds
> -- kolya
> OpenAFS-info mailing list