[OpenAFS] fileserver on etch may crash because ulimit -s 8192

Russ Allbery rra@stanford.edu
Wed, 03 Oct 2007 17:43:06 -0700


Jose Calhariz <jose.calhariz@tagus.ist.utl.pt> writes:

> I had an error message from the reiserfs, on Thursday night.  But the
> corruption went bigger, and more and more volumes were going offline.
> So I stop it to do an fsck.  That't when the fsck failed.  I didn't
> stopped the fileserver on Friday because was production hours.  Maybe my
> killing mistake.

Ah, okay.

I definitely recommend against using ReiserFS for any production purposes
(completely apart from whether you use AFS or not).

> I can be wrong, but I need to use my root.afs.  I need a link on /afs as
> a shortcut for my cellname.  So I can't use -dynroot on some clients.
> Correct me if I am wrong.

This is what the CellAlias configuration file is for.  It's hard to tell
exactly why the client didn't work; it doesn't sound like you have much
information about what failed or what could have been happening.

> I am talking by memory, as I didn't saved the log files.  I had seen
> messages of exit with various numbers, 0, 1 and maybe 15.  No core file,
> how do I enable core files?

Make sure that you don't have core limit size limited when you start the
file server and they should happen automatically if the file server
actually fails.  But if you don't have any exit status other than 0, 1,
and 15, the file server isn't failing.  Which again raises the question of
what the problem actually is.

If the file server is not existing with any status other than those three,
I'm 99% certain that the stack limit is not an issue for you.  What I
would expect, were it to run into a stack limit, would be a bus error or
segfault.

> vos listvol returned a communication error, I think the normal when
> the fileserver is down.

When the volserver is down, yes.  But once the salvage is finished, it
should be started.

> I want to test it again.  I couldn't do a proper debug of the situation.
> Too many problems at the same time and too litle hours of sleep :-(

Yeah, I understand there.  It's hard in those situations to know what
debugging information to get.

> I am going todo a restart of the foo server.  What I need to check if it
> fails, besides the logs files on /var/log/openafs and the existence of
> the core file.

If you have all the log files and can put them somewhere where others can
look at them, that would be very helpful.  Be more cautious about the core
file, since it will contain your cell key.  But it sounds like the file
server may not be failing.

The log files will probably be the most useful here.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>