[OpenAFS] fileserver on etch may crash because ulimit -s 8192

Jose Calhariz jose.calhariz@tagus.ist.utl.pt
Thu, 4 Oct 2007 03:19:42 +0100

Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Oct 03, 2007 at 05:43:06PM -0700, Russ Allbery wrote:
> Jose Calhariz <jose.calhariz@tagus.ist.utl.pt> writes:
> > I had an error message from the reiserfs, on Thursday night.  But the
> > corruption went bigger, and more and more volumes were going offline.
> > So I stop it to do an fsck.  That't when the fsck failed.  I didn't
> > stopped the fileserver on Friday because was production hours.  Maybe my
> > killing mistake.
> Ah, okay.
> I definitely recommend against using ReiserFS for any production purposes
> (completely apart from whether you use AFS or not).

I don't know what happen.  I have only two leads.  One IO error
message from reiserfs on the begin of everything.  And after the loss
I found a strange behavior with the hardware RAID5.  I need to do
further investigation.

And most important I learned I don't know enough about reiserfs guts.
So I really don't understand the error messages from reiserfsck.  I
will move into ext3, that I know very well, or XFS, I have a local
expert that can to help in case o trouble with XFS.

I remember see an online presentation from an AFS workshop were XFS
was considered best than ext3 for /vicep partitions.

> > I can be wrong, but I need to use my root.afs.  I need a link on /afs as
> > a shortcut for my cellname.  So I can't use -dynroot on some clients.
> > Correct me if I am wrong.
> This is what the CellAlias configuration file is for.  It's hard to tell
> exactly why the client didn't work; it doesn't sound like you have much
> information about what failed or what could have been happening.

Thank you.  I didn't know about that file.

> > I am talking by memory, as I didn't saved the log files.  I had seen
> > messages of exit with various numbers, 0, 1 and maybe 15.  No core file,
> > how do I enable core files?
> Make sure that you don't have core limit size limited when you start the
> file server and they should happen automatically if the file server
> actually fails. =20

Ok, I have by default "ulimit -c 0".  I don't depend on core files for
so many years I forget about ulimit -c 0.  Now I am a sysadm not a
programmer.  I only program in bash and install gdb for other people
to use, not for myself :-)

> But if you don't have any exit status other than 0, 1,
> and 15, the file server isn't failing.  Which again raises the question of
> what the problem actually is.
> If the file server is not existing with any status other than those three,
> I'm 99% certain that the stack limit is not an issue for you.  What I
> would expect, were it to run into a stack limit, would be a bus error or
> segfault.

I have restarted my fileserver.  No problem this time with "ulimit -s
8192".  So I think you are right.  My last 3 VLDB servers were in
trouble on that day and were creating more problems everywhere.  The
salvage was taking 40 minutes, so I had time to solve the other
problems before I put all my efforts on the last one.  The failing
file server.=20

Thank you for your help on this issue.

P.S. [En_US] The sig below is from my random sig-generator, which strangely
often seems to pick signatures which are apropriate to the message at

P.S. [Pt_Pt] A assinatura em baixo =E9 do gerador aleat=F3rio de
assinaturas, que estranhamente, escolhe com frequ=EAncia assinaturas que
parecem apropriadas ao email!

A vantagem de ser milion=E1rio =E9 poder falar o que ser quer, para quem se=
 quer e como se quer

--Pr=EDncipe Johannes von Thurn und

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

Version: GnuPG v1.4.6 (GNU/Linux)