[OpenAFS] DB servers seperate from fileservers

Mon, 7 Aug 2006 16:29:38 -0500

Esther Filderman <mizmoose@gmail.com> wrote:
> On 8/7/06, John Hascall <john@iastate.edu> wrote:
>
>>
>>   1) Stability.  The uptime of our DB servers is years,
>>      we can only dream of that for our fileservers.
>
> I'm currently running a mix.  My primary KDC and 'lowest IP" DB server
> is a non-fileserver machine.   The other two boxes do both.
>
> In addition to uptime, we also have the added stability of being able
> to take down the KDC without interrupting volume access.  This is very
> very nice.

Umm, am I missing something?  One of the major reasons I use AFS is the 
"vos move" command.  And it was my understanding that AFS can handle 
server outages without breaking.  Do you all have different experiences? 
If AFS can't handle a server outage (especially a planned one) there is 
no point in using it.

I patch and reboot all of our AFS servers about once a month to ensure 
that they have the latest operating system patches.  I usually also 
upgrade to the latest 1.4.x release (just installed 1.4.2b3 on a system 
today.)

>>   2) Restart speed.  Waiting for a pile of disk to
>>      fsck to get your DB servers up and running again
>>      is suboptimal.
>
> Again, having one machine as a DB-non-fileserver helps this greatly.
>
> We also run with --fast-restart compiled in. This is a pushme-pullyou.
> Basically all fast-restart does is skip the salvaging.  Now we have
> volumes crapping themselves here and there.  [Thank you, Fortran, you
> %*%()#.  Ahem.]

I also run with fast-restart.  Have not had any reported problems with 
volumes crapping out.  And I generally vos move eveything off of a 
fileserver before planned restarts, so there is nothing there for the 
salvager to keep offline.

> We're starting a routine of monthly salvages for each server to try to
> combat this.

Do salvages touch the volumes themselves, or is it just a parition level 
thing?  I.e. if I vos move volumes off of the paritions and mkfs them 
monthly, do I still need to worry about salvaging periodically?

>>   3) Load. A busy fileserver on the same machine as your
>>      DB server can slow your whole cell.
>
> Cannot argue with this.

Luckily, load isn't an issue for us yet, but I do see that as a valid 
point for some cells.

>>   4) Simplicity.  When something is amiss with a machine,
>>      the less things a machine is doing, the less things
>>      to check and the less likely it is the result of
>>      some wierd interaction.
>
> This is also why I advocate turning off everything else possible on an
> AFS server.  No AFS client.  Turn off everything you can.    Outside
> of AFS's own ports we have ntp and scp/ssh allowed in & out and that's
> about it.

Oh yes.  I don't run anything else on my AFS servers or KDCs.  I'd hate 
to see a flaw in openafs compromise a KDC and thus I keep them seperate. 
Although our (currently non-existant) DR plans might have a KDC and AFS 
server on the same machine, possibly in a Solaris zone.

>> Reasons for joining them would be (in my mind):
>>
>>   1) Cost.  Fewer machines == Less cost
>>      (however, you can easily run the DB servers
>>       low-cost, even hand-me-down boxes).
>
> My current DB-non-fileserver box was plucked out of the garbage.  I'm
> serious.

All of our AFS servers were donated to us from various places.

>>   2) Space, power, cooling.  Either you have these or you don't.
>>
>>   3) You got a really small cell, so it doesn't matter.
>
> Argueably I have, well, a mid-sized cell.  I'm supporting a fairly
> small number of frequently active users [maybe 250 on a good day],
> maybe 2000 total real users.  I don't think I've cracked 1T in used
> space yet.  A sizeable chunk of my volumes are stuffed with research
> databases and videos.
>
> Yet I find that the more servers you have the more stable you are.
> The more machines you are the less one machine's impact is felt.
>
> My cell used to be three machines, all DB & fileservers together,
> about 300G in use.  When one machine went down 1/3 of the cell was
> inaccessible. TOTAL MESS.
>
> Now I have 5 machines.  Not as good as I'd like, but still muuuuch
> more stable.

Yes, I've noticed that things are more stable now that we have 5 servers 
instead of 3.  But I think that is actually do to improvements in the 
AFS code, not b/c of the number of machines.

<<CDC
-- 
Christopher D. Clausen
ACM@UIUC SysAdmin