[OpenAFS] Can't get this going on Coraid CLN22 (Debian).

Tony Shadwick tshadwick@oss-solutions.com
Thu, 29 Mar 2007 20:00:49 -0500


I won't call it "fixed", but with much help from the guys in #openafs, 
we did get things working.

The problem appears to be in ulimit:

nas1:~# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
max rt priority                 (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

The stack size is set to 8192.  We had to change that to unlimited, then 
things started working, so ulimit -s unlimited.

Ed, if you see this...any thoughts on what might cause this?

I've been instructed to file a bug report on openafs-bugs, and to debian 
regarding the package, as the /etc/init.d/openafs-filserver script has 
to be modified to do ulimit -s unlimited at each startup, as the setting 
is a per-session thing.  Speculation as to the cause is welcome.

Please don't think a small thing of this.  I've spent well over 40 
hours, along with the help of several people to weed this out!

Tony Shadwick
OSS Solutions

Tony Shadwick wrote:
> I've been bouncing in and out of #OpenAFS for the last week trying to 
> get this working, and I've been working with Coraid support and all to 
> no avail.  It appears something is up with pthreads, but Coraid support 
> ran a test and pthreads work in the kernel.  Rather than copy and paste 
> the whole long deal, here's the page I have on my site with all of the 
> info:
> 
> http://www.numbski.com/hacks/coraid/openafs-on-cln22.html
> 
> In that log you'll see I've tried using both afs-newcell and the script 
> found at Debian World.
> 
> Here's the logs without and without fileserver -d 99 turned on (I know, 
> bad loglevel, didn't know until afterwards though):
> 
> nas1:/var/log/openafs# cat /var/log/openafs/FileLog
> Thu Mar 29 13:52:06 2007 File server starting
> Thu Mar 29 13:52:06 2007 afs_krb_get_lrealm failed, using
> oss-solutions.com.
> Thu Mar 29 13:52:06 2007 Set thread id 14 for FSYNC_sync
> Thu Mar 29 13:52:06 2007 Partition /vicepa: attaching volumes
> Thu Mar 29 13:52:06 2007 Partition /vicepa: attached 0 volumes; 0
> volumes not attached
> Thu Mar 29 13:52:06 2007
> : Assertion failed! file ../viced/viced.c, line 1956.
> 
> 
> and with logging turned up:
> 
> nas1:/var/log/openafs# cat FileLog
> Thu Mar 29 14:03:02 2007 File server starting
> Thu Mar 29 14:03:02 2007 afs_krb_get_lrealm failed, using
> oss-solutions.com.
> Thu Mar 29 14:03:02 2007 VL_RegisterAddrs rpc failed; will retry
> periodically (code=5376, err=0)
> Thu Mar 29 14:03:02 2007 Set thread id 14 for FSYNC_sync
> Thu Mar 29 14:03:02 2007 Partition /vicepa: attaching volumes
> Thu Mar 29 14:03:02 2007 Partition /vicepa: attached 0 volumes; 0
> volumes not attached
> Thu Mar 29 14:03:02 2007 Starting pthreads
> Thu Mar 29 14:03:02 2007 Starting five minute check process
> Thu Mar 29 14:03:02 2007 Set thread id 15 for 'FiveMinuteCheckLWP'
> Thu Mar 29 14:03:02 2007
> : Assertion failed! file ../viced/viced.c, line 1958.
> 
> The code in question:
> 
> 1954    assert(pthread_create
> 1955           (&serverPid, &tattr, (void *)FiveMinuteCheckLWP,
> 1956            &fiveminutes) == 0);
> 1957    assert(pthread_create
> 1958           (&serverPid, &tattr, (void *)HostCheckLWP, &fiveminutes) 
> == 0);
> 1959    assert(pthread_create
> 1960           (&serverPid, &tattr, (void *)FsyncCheckLWP, &fiveminutes) 
> == 0);
> 1961 #else /* AFS_PTHREAD_ENV */
> 1962    ViceLog(5, ("Starting LWP\n"));
> 1963    assert(LWP_CreateProcess
> 1964           (FiveMinuteCheckLWP, stack * 1024, LWP_MAX_PRIORITY - 2,
> 1965            (void *)&fiveminutes, "FiveMinuteChecks",
> 1966            &serverPid) == LWP_SUCCESS);
> 
> Totally lost, frustrated and confused.  Any devs wish to take pity on me 
> and help?  This is an AMD64 box running Debian.
> 
> Tony Shadwick
> OSS Solutions
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info