[OpenAFS] fileserver on etch may crash because ulimit -s 8192

Thu, 4 Oct 2007 01:11:22 +0100

--TB36FDmn/VVEgNH/
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Oct 03, 2007 at 10:58:07AM -0700, Russ Allbery wrote:
> Jose Calhariz <jose.calhariz@tagus.ist.utl.pt> writes:
>=20
> > When server bar started with problems I initiated to move online the
> > volumes from server bar to server foo.  Before the finish of the move I
> > had to stop server bar to do a fsck on /vicepa that failed.
>=20
> Why did you have to stop the server?  Why did you have to do an
> fsck?

I had an error message from the reiserfs, on Thursday night.  But the
corruption went bigger, and more and more volumes were going offline.
So I stop it to do an fsck.  That't when the fsck failed.  I didn't
stopped the fileserver on Friday because was production hours.  Maybe
my killing mistake. =20

Just to be clear you are here asking about the bar server.  The one I
lost the /vicepa partition.  It is the foo server that didn't restart,
maybe because ulimit -s 8192.

>=20
> > On the morning of Sunday I found the two fileservers down, my extra 3
> > DB/Mail servers with problems because the mail server had started to
> > many process, and last but not least my backup server couldn't start
> > because of the afs client.  It was stopping on the launch of afsd.
>=20
> You probably want to be using -dynroot on your AFS clients, although if
> your VLDB servers were down, that may not be enough; I don't remember.
> Anyway, that would just let afsd start; it wouldn't help with not being
> able to see your cell.

Some clients didn't had problems.  I checked other similar client and
it worked.  My VLDB servers are file servers for root.cell and
root.afs.  Maybe the clients were contacting different RO replicas of
root.cell and root.afs.  One of them were still good, the other was
unresponsive.

I can be wrong, but I need to use my root.afs.  I need a link on /afs
as a shortcut for my cellname.  So I can't use -dynroot on some
clients.  Correct me if I am wrong.

>=20
> > On foo server I didn't find any good message error message on
> > /var/log/openafs behind the Salvage that finished with success.
>=20
> The information about why the file server died should be in the last lines
> of FileLog.old unless it died with a signal, such as a segfault.  If it
> did the latter, that information will be in the BosLog.  What does BosLog
> say?  Do you have a core file?  If it died with a signal (like BUS or
> SEGV), you should have a core file; check /var/log/openafs.  If you don't
> have a core file, investigate whether you accidentally started the file
> server with core dumps disabled.

I am talking by memory, as I didn't saved the log files.  I had seen
messages of exit with various numbers, 0, 1 and maybe 15.  No core
file, how do I enable core files?

>=20
> > Whenever I had done /etc/init.d/openafs-fileserver stop and start the
> > foo server went into Salvage and in the end I couldn't get a "vos
> > listvol foo".
>=20
> Why couldn't you?  What error message did you get?  What did the salvage
> log say?  What did the FileLog say after salvaging finished?
>=20

vos listvol returned a communication error, I think the normal when
the fileserver is down.

The salvage was what I expect of an unclean shutdown of the machine.
On logs files I only remember of process exit number.  But I don't
remember of the exit number or the exact program.  It could be the VL
program crashing instead of the filserver.  I am going to do a restart
of the fileserver to check if this condition really still exist.

> > Restarting the 3 extra DB/Mail servers solved problems with the backup
> > server.
>=20
> If your VLDB servers were all down, that may be why your file server
> couldn't start.

I believe the VLDB start to have problems after my foo server went
down, that server brougth down all the remaing Maildirs to deposit mail.
So the mailservers running on the last 3 VLDB machines, caused a DoS
on the machines.  The other 2 VLDB servers were on the stopped
fileservers.=20

>=20
> > After this I tried a hint I found in the Internet, someone with the same
> > problem like I had with foo server, said ulimit -s 8192 was not enough
> > and would bug report to Debian.  So I have done an "ulimit -s unlimited"
> > on shell and started one more time the fileserver.  This time after a
> > successful salvage I had the volumes online.
>=20
> You sure that wasn't because you fixed your VLDB server?

I want to test it again.  I couldn't do a proper debug of the
situation.  Too many problems at the same time and too litle hours of
sleep :-(

>=20
> > I didn't found any bug report on BTS or on the changelog about this
> > issue.  So I am asking here.  As more people could had this same issue
> > on other Linux distributions or Unix.
>=20
> > Maybe my problem was the 3 extra DB server with problems, as I didn't
> > had enough DB servers for quorum, I had maybe 1 or 2 DB servers out of
> > 5.
>=20
> This seems more likely.  At least so far, I'm not seeing much indication
> that the stack size limitation was related to your problem or its
> resolution, and as mentioned, we're running ten servers with around 15,000
> AFS clients on Debian etch without needing to change any process limits
> from the defaults.
>=20

I to have another cell with more volumes per fileserver, whithout
problems. =20

I am going todo a restart of the foo server.  What I need to check if it
fails, besides the logs files on /var/log/openafs and the existence of
the core file.

--=20
P.S. [En_US] The sig below is from my random sig-generator, which strangely
often seems to pick signatures which are apropriate to the message at
hand!

P.S. [Pt_Pt] A assinatura em baixo =E9 do gerador aleat=F3rio de
assinaturas, que estranhamente, escolhe com frequ=EAncia assinaturas que
parecem apropriadas ao email!
--

Um homem n=E3o =E9 necessariamente inteligente porque tem boas id=E9ias, da=
 mesma forma que n=E3o =E9 bom general por ter muitos soldados

--Nicolas Chamfort

--TB36FDmn/VVEgNH/
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHBC+qQlvqh9sPbBoRApC4AKC/EpD6+sGlnecZLvBBpZ/mvsgWLgCeIBFP
6wUthkreLE8ovRz31PfgVi0=
=sBMv
-----END PGP SIGNATURE-----

--TB36FDmn/VVEgNH/--