[OpenAFS] OpenAFS in a production environment

Christopher D. Clausen cclausen@acm.org
Fri, 12 Aug 2005 19:55:27 -0500


Sean Kelly <smkelly@rooster.creighton.edu> wrote:
>
> There seem to be a lot of reports about OpenAFS causing many Linux
> kernel panics, having issues with newer or older kernel versions,
> random configurations, or just instability in general. Is this
> because OpenAFS really is a very sensitive program/service, or is it
> the standard problem of only hearing from those who are having a
> problem?

IMHO, these are Linux problems and not openafs problems.  There are just
too many different kernel configurations and weird filesystems to test
on.  For instance, several of the reported problems are likely do to
users who have their afs cache partition set to a non-ext2 filesystem,
or did not correctly allocate an additional 10% of space as is
apparently needed.  Or didn't dedicate a cache parition or similar
configuration issue.  You have to remember that there are going to be
more problems on Linux b/c more people are trying it.  Large
installations generally figure this out in testing and roll out the
correct configuration to all labs and other supported machines.

I had similar problems when we setup the acm.uiuc.edu cell two years
ago.  Basically, we decided to simply NOT use Linux for the main AFSDB
servers and have been using Solaris (on both x86 and sparc) ever since.

Our client x86 Linux machines use the -memcache option and I have not
had any reported problems (well, there was some weird issue with an
nvidia driver not loading, but a kernel recompile seems to have fixed
this.)  I have had some reported problems on other Linux architectures,
such as Alpha and PowerPC.  I haven't yet tracked the problem down to
either an application that just doesn't like or AFS or if it is some
kernel or afs client issue.  These problems are the risk of running
Linux on non-x86 architectures.  Additionally, I have one machine that
is currently testing the Debian "sarge" release of the openafs module
and client on the 2.6 kernel.  I also have some fileserver only machines
(saiph, bellatrix) using the 2.4 kernel that appear to be working quite
well.

> Does OpenAFS work with RHEL AS 4? RHEL AS 3? HP-UX and Solaris?

We're a Debian shop so I cannot comment on RHEL.  I just acquired two HP
B2000 workstations that I intend to use to test OpenAFS under HP-UX 11i
but I have not yet had the time.  Solaris works great (we've used
version 8, 9, and 10), aside from the problem I posted about the panics
caused by the wrong kernel module.  I will mention that you will most
definately want to have a licensed version of the sun studio compiler in
order to compile openafs.  I'm pretty sure that gcc still doesn't work
and it appears that the Makefiles are hardcoded to use Sun's cc in
certain places.

> I'd be interested in hearing an in depth description of how people are
> using AFS both in commercial and educational environments today and
> their success and failure stories. I realize AFS has been around for
> a long time and do have the _Managing AFS_ book, but I'd be curious
> if things have changed in the post-TransArc era.

We are a student organization at the University of Illinois.  We have no
budget and operate almost entirely on donated hardware and sysadmin
time.  As such, the admins (all volunteers) basically decided that NFS
was terrible (esp on Linux) and that we would setup AFS.  AFS works
equally well on the primary OSes that we support: Mac OS, Linux, and
Windows; quite excellently on Solaris.  NFS from Windows was terrible
back in 2002 and AFS was a much welcomed improvement for the Windows
machines (myself being primarily a Windows admin.)  And since we are a
student group, there are those who might attempt to sniff network
traffic, access other people's files, or do other evil things.  AFS
provided some layer of protection against this and added to the Kerberos
5 intrafrastructure that we already had setup.

AFS also works relatively well with the campus's provided Active
Directory (AD.UIUC.EDU.)  The windows boxes are all joined to this (the
Macs were too, until NIS and Kerberos support improved) and I have group
policies set to redirect Desktops and My Documents into the user's AFS
volume at login, provided that users have manually synced their
passwords.  Alternately, users can use gssklog to obtain AFS tokens
using their AD tickets.  I already have it setup, but I would advise
against using gssklogd or similar ticket-token translators and instead
try to use pure Kerberos 5 for eveything.  I have little control over
the Active Directory infrastructure and as such we maintain our own
Kerberos 5 realm.  Ideally, I'd like to eventually merge everything into
the campus provided environment, but there remain some issues with
extracting keytabs and changing service principals easily from AD.

Most of our users are computer science students and generally can figure
things out on their own and as such I don't have to provide much
support.  Of course, with a more general user base, support becomes a
serious concern.  I have setup an AFS to WebDAV service using programs
suggested on this list, but its not up right now b/c there were too few
people actually using it.  This is likely a solution for most
educational institutions to provide access to the general user while
allowing those who wish to dive in and install OpenAFS on their own
computer and allows an easy way to handle support, as technical people
can check the WebDAV logs to see if the user is actually connecting, if
authentication is working, and any errors that are occurring.

My users can and do create their own groups to secure their own
directories.  I imagine that few users do this at most sites, but the
ability to do so is great.  The ability to make changes to AFS PTS
groups from almost any platform is a hude benefit.  I'm on windows
machines all day and the tools (command line at least, I haven't used
the GUI) to admin it work just as well as on Solaris, or Linux, or Mac
OS.  The ability to change who has ownership and admin rights on the
groups themselves is also a benefit.  I used to have to manually add
users to NIS groups and now I can delegate that task back to the users.

The biggest benefit for us is that we can do server re-installs and
upgrades whenever we want and just leave a server down for a few days if
we don't feel like continuing the upgrade at any particular time.  There
is little or no performance degradation and few if any users even notice
that a server is down.  Users do however notice when volumes are being
moved between servers.  Being on only 100BASE networks limits the
available bandwidth and not having a large number of servers means that
there is disk I/O contention on both the source and destination volume
whenever a vos move is taking place.  Again, this is a solvable problem
by either added more servers and load-balancing better or doing some
more in-depth performance tuning on both the client and server side.

I just reinstalled one of our main AFS servers last night.  (I'm daring
and decided to roll 1.3.87 into production.  You likely to not want to
do this without more testing than what I did.)  It took several hours as
there was some weird md problem that caused the machine to kernel panic
repeatedly.  During this time our cell remained up, users continueed to
work, and I continued to stream music right out of AFS.  Normally, most
places our size have a single server everything on it and when its down,
its down.  But, this brings me to the some of disadvantages of AFS.

AFS requires a large number of machines to function properly.  Most
sites (or at least this is what I have seen) have at least two Kerberos
KDCs and three main AFS DB servers.  Most sites have additional
fileservers as well, and probably a dedicated web and/or email server.
We used to have all of our provided services on three Linux machines.
We now have 5 Solaris machines (AFS DB servers and KDCs,) 3 Linux
servers (DB, web, email,) and an AIX machine handling backups to TSM
(expensive if you don't have it already setup like UIUC does, but file
level backups and restores.)  Depending on your environment, you might
not want to support such a large number of machines.

I attempted to have email delivered into AFS.  On almost all accounts
people say that this is a bad idea and that you'll be sorry.  I
generally agree that it is a huge pain and you'll likely end up
maintaining local mods to whatever email servers you end up using in
order to get them to work correctly.  This won't stop me from trying
though.  Mail currently delivers fine using exim, but I can't find an
IMAP server that works with AFS in the way I want it to.

There is more info about ACM and are particular setup at:
https://www-s.acm.uiuc.edu/wiki/space/admin and http://www.acm.uiuc.edu/
The websites as well as the wiki run directly out of AFS from
/afs/acm.uiuc.edu/common/wwwroot.  This has worked quite well for us.
We can keep users off of the webserver and let them edit content from
any machine as well as having RO replicas to add a layer of auditing
before changes to the main page are released.

Feel free to contact me off-list if you would like to hear more.  And, I
spend a good portion of time in the #openafs IRC channel, as mentioned
on this list.  I can also rant on UIUC provide infrastructure that is
almost useless to the slightly above average user b/c of various
limitations.

-----

Bah!

I finish writting this and of course our webserver freaks out:
[root@wilbur:/afs/acm/common/wwwroot/docs]# cat sigs.shtml
cat: sigs.shtml: Input/output error

Apparently the www replicas got hosed somehow.  I guess the webserver
heard mention of itself and needed some attention...  A simple fs rmm
wwwroot; fs mkm wwwroot www -rw; vos release www fixes this problem and
I can track down what actually happened or simply nuke and recreate the
volume replicas at a later time.  Very odd occurance though.  This is
the first weird thing that I can think of that happened on the webserver
in the few years we've had AFS up and running.

<<CDC
Christopher D. Clausen
ACM@UIUC SysAdmin