[OpenAFS] sudden crash

Neulinger, Nathan nneul@umr.edu
Tue, 30 Jul 2002 10:00:18 -0500


On our internal builds, we typically lower both the check servers
interval and the rx_deadtime settings to much lower numbers to speed
recovery. Doesn't always help though - there do appear to be a number of
cases where it has a hard time seeing that the server has stopped
responding...

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216


> -----Original Message-----
> From: Nickolai Zeldovich [mailto:kolya@MIT.EDU]=20
> Sent: Tuesday, July 30, 2002 9:52 AM
> To: OpenAFS-info@openafs.org
> Subject: Re: [OpenAFS] sudden crash
>=20
>=20
> > what is the expected behavior if i'm reading a big
> > file from a replicated RO-Volume and the server
> > (I'm actually reading from) crashes?
>=20
> The client will time out on the read RPC and try another one of the
> read-only replicas.
>=20
> > How long will the cachemanager wait until he
> > decides to choose another server where a
> > RO-copy resides?
>=20
> AFS_RXDEADTIME, declared in src/afs/afs.h, which is 50 seconds.
>=20
> > Will the cachemanager be able to decide that
> > it's time to use the RW of that volume
> > (if no more replicas are available)?
>=20
> The cache manager will never fall back to RW volumes, but remember
> that you get a "free" replica of the RW volume on the same partition
> as it resides (as long as you add the replica on the same partition,
> it will not take up any additional disk space, being copy-on-write).
>=20
> > Will the cachemanager read the whole file again
> > or just the part not read yet?
>=20
> It will not re-fetch the chunks it already fetched successfully.
>=20
> > How often does the cachemanager check if the
> > crashed server is available again?
>=20
> The afs_CheckServerDaemon() thread tries to check each server
> every PROBE_INTERVAL seconds (180 by default); in practice it
> ends up being a little more than 180 seconds.
>=20
> > What if the crashed server was the DB-Server
> > currently used (another one is available)?
> > How long will the cachemanager try to contact
> > the crashed server until he decides to
> > choose the other one?
>=20
> Same as for file servers; AFS_RXDEADTIME, which is 50 seconds
> typically.
>=20
> -- kolya
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>=20