[OpenAFS] sudden crash

Nickolai Zeldovich kolya@MIT.EDU
Tue, 30 Jul 2002 10:52:23 -0400


> what is the expected behavior if i'm reading a big
> file from a replicated RO-Volume and the server
> (I'm actually reading from) crashes?

The client will time out on the read RPC and try another one of the
read-only replicas.

> How long will the cachemanager wait until he
> decides to choose another server where a
> RO-copy resides?

AFS_RXDEADTIME, declared in src/afs/afs.h, which is 50 seconds.

> Will the cachemanager be able to decide that
> it's time to use the RW of that volume
> (if no more replicas are available)?

The cache manager will never fall back to RW volumes, but remember
that you get a "free" replica of the RW volume on the same partition
as it resides (as long as you add the replica on the same partition,
it will not take up any additional disk space, being copy-on-write).

> Will the cachemanager read the whole file again
> or just the part not read yet?

It will not re-fetch the chunks it already fetched successfully.

> How often does the cachemanager check if the
> crashed server is available again?

The afs_CheckServerDaemon() thread tries to check each server
every PROBE_INTERVAL seconds (180 by default); in practice it
ends up being a little more than 180 seconds.

> What if the crashed server was the DB-Server
> currently used (another one is available)?
> How long will the cachemanager try to contact
> the crashed server until he decides to
> choose the other one?

Same as for file servers; AFS_RXDEADTIME, which is 50 seconds
typically.

-- kolya