[OpenAFS] [Q] Some questions on AFS...

Paul Blackburn mpb@est.ibm.com
Fri, 25 Oct 2002 09:47:36 +0100


S.J.Chun wrote:

>Hi,
>
>If the master AFS server does down(or crashes) what happens to users and the 
>other servers ? How can I set fail over mechanism on master AFS server?
>And, how can I extend life time of a token ?
>
>Thanks in advance.
>_______________________________________________
>OpenAFS-info mailing list
>OpenAFS-info@openafs.org
>https://lists.openafs.org/mailman/listinfo/openafs-info
>
Hello,

If you have only one AFS server and it crashes then you have problems
(at least until it is rebooted).


What we do is to have three dedicated AFS database servers
and separate dedicated AFS fileservers.

AFS database servers

By having multiple AFS database servers you provide
highly available services for your AFS cell.

If one crashes then the others continue to provide service.
Clients may or may not notice a slight delay if one out
of three database servers is down. This is because when
a client needs to communicate with a DB server, it chooses
one at random from the CellServDB list. If the selected
DB server does not respond, then after a timeout delay
the client selects another.

There needs to be synchronization of data between the DB servers.
This is achieved using "Ubik".

Multiple DB servers have a voting system to decide
which one will be the "sync site" (or master DB server).

If this "sync site" db server fails, the remaining DB servers
vote between themselves to decide a new "sync site".

So, when you recover a failed DB server, it automagically
re-joins the "Ubik" synchronisation and "sync site" voting.

The other good advantage of having dedicated DB servers
is for performance: the processing load is now distributed
over several machines. Also, by only running AFS DB
processes (eg no general user login or other services: web etc)
you provide optimum AFS service which won't be
degraded by non-AFS processing.

You can use relatively low-cost machines for DB servers.
They don't need much disk space.

Fileservers

You can improve the robustness of access to /afs/@cell/
by having several dedicated AFS fileservers and creating
replicated ReadOnly copies of your ReadWrite volumes.

Typically, you use this for data that is "read-mostly".
For example: the top level directories of your cell
like root.afs, root.cell, and (if you have one) root.othercells.

Once you have replicated the root.cell volume onto two
or more fileservers then access to /afs/@cell/ will still
work if one fileserver is down.

It turns out that there is alot of "read-mostly" data
which you can replicate: top level directories, executables
and scripts, documentation, HTML pages and graphics.
It just depends on your file content as to the best way to replicate.

One point to remember, it is not effective to try to
have replicated ReadOnly copies of dynamic data
(for example: personal home directories).

However, one you have multiple fileservers you
have reduced your dependency on a single machine
and therefore the impact of s single machine failing
is much less.

One of the really neat things about AFS is that you can
move AFS volumes between fileservers without having
an impact on your "live" users.

One point about fileservers: they do not have a voting
system like database servers.


So, you could buy a high specification new fileserver,
add it into your AFS cell, and move all AFS volumes
off an old fileserver to the new one with no outage
of your AFS services to your "live" users.

AFS gives you a excellent ways to manage your fileservice.

So, the "failover" you asked about to is _free_
if you build your AFS cell in a robust way.

> And, how can I extend life time of a token ?

There are a few ways, depending on what you want to do.
AFS administrators can alter token lifetimes using the kas command.
If you want to run some afs-authenticated task for a long time
you could use an automatic re-authentication process (like reauth).
There are also ways to have AFS authentication for long-running batch jobs.
eg: http://www.lam-mpi.org/software/psr/

I hope this helps.
--
cheers
paul http://acm.org/~mpb