[OpenAFS] fileserver on etch may crash because ulimit -s 8192

Wed, 03 Oct 2007 10:58:07 -0700

Jose Calhariz <jose.calhariz@tagus.ist.utl.pt> writes:

> When server bar started with problems I initiated to move online the
> volumes from server bar to server foo.  Before the finish of the move I
> had to stop server bar to do a fsck on /vicepa that failed.

Why did you have to stop the server?  Why did you have to do an fsck?

> On the morning of Sunday I found the two fileservers down, my extra 3
> DB/Mail servers with problems because the mail server had started to
> many process, and last but not least my backup server couldn't start
> because of the afs client.  It was stopping on the launch of afsd.

You probably want to be using -dynroot on your AFS clients, although if
your VLDB servers were down, that may not be enough; I don't remember.
Anyway, that would just let afsd start; it wouldn't help with not being
able to see your cell.

> On foo server I didn't find any good message error message on
> /var/log/openafs behind the Salvage that finished with success.

The information about why the file server died should be in the last lines
of FileLog.old unless it died with a signal, such as a segfault.  If it
did the latter, that information will be in the BosLog.  What does BosLog
say?  Do you have a core file?  If it died with a signal (like BUS or
SEGV), you should have a core file; check /var/log/openafs.  If you don't
have a core file, investigate whether you accidentally started the file
server with core dumps disabled.

> Whenever I had done /etc/init.d/openafs-fileserver stop and start the
> foo server went into Salvage and in the end I couldn't get a "vos
> listvol foo".

Why couldn't you?  What error message did you get?  What did the salvage
log say?  What did the FileLog say after salvaging finished?

> Restarting the 3 extra DB/Mail servers solved problems with the backup
> server.

If your VLDB servers were all down, that may be why your file server
couldn't start.

> After this I tried a hint I found in the Internet, someone with the same
> problem like I had with foo server, said ulimit -s 8192 was not enough
> and would bug report to Debian.  So I have done an "ulimit -s unlimited"
> on shell and started one more time the fileserver.  This time after a
> successful salvage I had the volumes online.

You sure that wasn't because you fixed your VLDB server?

> I didn't found any bug report on BTS or on the changelog about this
> issue.  So I am asking here.  As more people could had this same issue
> on other Linux distributions or Unix.

> Maybe my problem was the 3 extra DB server with problems, as I didn't
> had enough DB servers for quorum, I had maybe 1 or 2 DB servers out of
> 5.

This seems more likely.  At least so far, I'm not seeing much indication
that the stack size limitation was related to your problem or its
resolution, and as mentioned, we're running ten servers with around 15,000
AFS clients on Debian etch without needing to change any process limits
from the defaults.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>