[OpenAFS-devel] Re: I watched when crush of one afs-server have as effect crash other afs-server from the same cell.

Mike Polek mike@pictage.com
Fri, 13 May 2005 10:36:32 -0700


We used to experience this kind of thing. I think this was
back around OAFS 1.2.10 or so. One fileserver would spontaneously
go offline, usually do to heavy load. Suddenly, the whole
cell would go down. (At least it appeared that way. Might just
be that all the clients got stuck trying to communicate with
the one server.)

Our solution was to write a little monitor program that runs
afsmonitor to poll the fileserver and make sure it gets
a timely response. If it looks like the fileserver has
gone offline, it uses iptables to block the port, so that
the rest of the cell thinks the machine has gone away,
which the cell can deal with. The monitor program also
maintains a status file which Nagios can poll, so we
get an alert when a server has a problem, and we can
respond to it immediately.

The RH 7.3 OAFS 1.2.13 servers seem very stable.

The RH 9 OAFS 1.2.13 server periodically go offline spontaneously
still, but the behavior is a little different. They now
return a -1 sequence number to afsmon, so they appear
as though somebody did a bos stop, rather than just
hanging. We have to go in and do a bos restart to get
things going again.

Once the current OAFS line stabilizes and becomes
1.4.X, we plan to move to FC 3 and OAFS 1.4.X, and
I'm hoping the issue will go away. :-)

Take care,
Mike Polek
Pictage, Inc.

Vitaly <cvv@email.zp.ua> wrote:
> 
 > Message: 5
 > Date: Fri, 13 May 2005 10:27:55 +0300 (EEST)
 > From: Vitaly <cvv@email.zp.ua>
 > To: openafs-devel@openafs.org
 > Subject: [OpenAFS-devel] I watched when crush of one afs-server have as
 > effect crash other afs-server from the same cell.
 >
 > I have two node afs cell.
 > i'm use afs 1.2.13 on linux 2.4.xx
 > yesterday I watch next : after one server crash in case hardware problem
 > the other server simultaneously fully crash without any message in logs, at
 > screen and etc.

 > this node is a server and client simultaneously. after crush client work
 > normally but from server I not found no one service. All died.

 > is anyone intresting this situation??