[OpenAFS] DB servers seperate from fileservers
Christopher D. Clausen
cclausen@acm.org
Mon, 7 Aug 2006 16:29:38 -0500
Esther Filderman <mizmoose@gmail.com> wrote:
> On 8/7/06, John Hascall <john@iastate.edu> wrote:
>
>>
>> 1) Stability. The uptime of our DB servers is years,
>> we can only dream of that for our fileservers.
>
> I'm currently running a mix. My primary KDC and 'lowest IP" DB server
> is a non-fileserver machine. The other two boxes do both.
>
> In addition to uptime, we also have the added stability of being able
> to take down the KDC without interrupting volume access. This is very
> very nice.
Umm, am I missing something? One of the major reasons I use AFS is the
"vos move" command. And it was my understanding that AFS can handle
server outages without breaking. Do you all have different experiences?
If AFS can't handle a server outage (especially a planned one) there is
no point in using it.
I patch and reboot all of our AFS servers about once a month to ensure
that they have the latest operating system patches. I usually also
upgrade to the latest 1.4.x release (just installed 1.4.2b3 on a system
today.)
>> 2) Restart speed. Waiting for a pile of disk to
>> fsck to get your DB servers up and running again
>> is suboptimal.
>
> Again, having one machine as a DB-non-fileserver helps this greatly.
>
> We also run with --fast-restart compiled in. This is a pushme-pullyou.
> Basically all fast-restart does is skip the salvaging. Now we have
> volumes crapping themselves here and there. [Thank you, Fortran, you
> %*%()#. Ahem.]
I also run with fast-restart. Have not had any reported problems with
volumes crapping out. And I generally vos move eveything off of a
fileserver before planned restarts, so there is nothing there for the
salvager to keep offline.
> We're starting a routine of monthly salvages for each server to try to
> combat this.
Do salvages touch the volumes themselves, or is it just a parition level
thing? I.e. if I vos move volumes off of the paritions and mkfs them
monthly, do I still need to worry about salvaging periodically?
>> 3) Load. A busy fileserver on the same machine as your
>> DB server can slow your whole cell.
>
> Cannot argue with this.
Luckily, load isn't an issue for us yet, but I do see that as a valid
point for some cells.
>> 4) Simplicity. When something is amiss with a machine,
>> the less things a machine is doing, the less things
>> to check and the less likely it is the result of
>> some wierd interaction.
>
> This is also why I advocate turning off everything else possible on an
> AFS server. No AFS client. Turn off everything you can. Outside
> of AFS's own ports we have ntp and scp/ssh allowed in & out and that's
> about it.
Oh yes. I don't run anything else on my AFS servers or KDCs. I'd hate
to see a flaw in openafs compromise a KDC and thus I keep them seperate.
Although our (currently non-existant) DR plans might have a KDC and AFS
server on the same machine, possibly in a Solaris zone.
>> Reasons for joining them would be (in my mind):
>>
>> 1) Cost. Fewer machines == Less cost
>> (however, you can easily run the DB servers
>> low-cost, even hand-me-down boxes).
>
> My current DB-non-fileserver box was plucked out of the garbage. I'm
> serious.
All of our AFS servers were donated to us from various places.
>> 2) Space, power, cooling. Either you have these or you don't.
>>
>> 3) You got a really small cell, so it doesn't matter.
>
> Argueably I have, well, a mid-sized cell. I'm supporting a fairly
> small number of frequently active users [maybe 250 on a good day],
> maybe 2000 total real users. I don't think I've cracked 1T in used
> space yet. A sizeable chunk of my volumes are stuffed with research
> databases and videos.
>
> Yet I find that the more servers you have the more stable you are.
> The more machines you are the less one machine's impact is felt.
>
> My cell used to be three machines, all DB & fileservers together,
> about 300G in use. When one machine went down 1/3 of the cell was
> inaccessible. TOTAL MESS.
>
> Now I have 5 machines. Not as good as I'd like, but still muuuuch
> more stable.
Yes, I've noticed that things are more stable now that we have 5 servers
instead of 3. But I think that is actually do to improvements in the
AFS code, not b/c of the number of machines.
<<CDC
--
Christopher D. Clausen
ACM@UIUC SysAdmin