[OpenAFS] AFS 1.2.8 fileserver Failing in GetClient()

Douglas E. Engert deengert@anl.gov
Mon, 31 Mar 2003 14:19:10 -0600


After looking at the AFS Bug Tracking, this problem looks like an old problem,
1257, which was resolved on 2/3/3. 

But the resolution looks like it only added some extra error messages, not solved 
the problem. The comments indicate that backing off to 1.2.6 did not solve the problem,
but a patch from lha@stacken.kth.se might have. It is not clear what the patch 
is, or if it is in the current source. 

Any ideas on getting around this problem? 
Any parameter to the fileserver which might help?
  

"Douglas E. Engert" wrote:
> 
> Two separate AFS servers are having the same problem of the fileserver process
> dumping. They are running AFS 1.2.8 on Solaris 5.8. AFS 1.2.8 was installed during
> the week. (It is not clear if the problems started when 1.2.8 was installed.
> but it might have.) The third server is not having problems, but does not have
> many volumes.
> 
> Any ideas?
> 
> The dump of fileserver show this trace:
> 
> (gdb) where
> #0  0xff0d9764 in __sigprocmask () from /usr/lib/libthread.so.1
> #1  0xff0ce978 in _resetsig () from /usr/lib/libthread.so.1
> #2  0xff0ce118 in _sigon () from /usr/lib/libthread.so.1
> #3  0xff0d1158 in _thrp_kill () from /usr/lib/libthread.so.1
> #4  0xff14b9dc in raise () from /usr/lib/libc.so.1
> #5  0xff1358fc in abort () from /usr/lib/libc.so.1
> #6  0x0004a008 in AssertionFailed ()
> #7  0x0003f4f4 in GetClient ()
> #8  0x0003858c in GetVolumePackage ()
> #9  0x0002feec in SAFSS_StoreStatus ()
> #10 0x000300fc in SRXAFS_StoreStatus ()
> #11 0x0005e74c in _RXAFS_StoreStatus ()
> #12 0x000630c0 in RXAFS_ExecuteRequest ()
> #13 0x000768d0 in rxi_ServerProc ()
> #14 0x000741e4 in rx_ServerProc ()
> #15 0x00073bb0 in server_entry ()
> 
> This appears to be failing in the GetClient() routine when called
> from the GetVolumePackage.
> 
> The BosLog log shows this for example:
> 
> Mon Mar 31 09:25:01 2003: fs:file exited on signal 6 (core dumped)
> Mon Mar 31 09:25:01 2003: fs:vol exited on signal 15
> Mon Mar 31 09:27:39 2003: fs:salv exited with code 0
> 
> and the FileLog shows(Not sure if they are related):
> 
> Mon Mar 31 09:00:01 2003 *** Vid=32766, sid=fa117a18, tcon=5a5e08, Tcon=59d708 ***
> Mon Mar 31 09:05:00 2003 *** Vid=32766, sid=f9e25928, tcon=5a8418, Tcon=5aa4d8 ***
> Mon Mar 31 09:05:49 2003 *** Vid=32766, sid=fa117a20, tcon=5a75e8, Tcon=5aaa60 ***
> Mon Mar 31 09:10:00 2003 *** Vid=32766, sid=fa117a3c, tcon=5aa5a0, Tcon=5a5e08 ***
> Mon Mar 31 09:14:26 2003 *** Vid=32766, sid=f9e25944, tcon=5941d8, Tcon=5aba90 ***
> Mon Mar 31 09:16:58 2003 *** Vid=32766, sid=f9e25950, tcon=58c540, Tcon=58faf8 ***
> Mon Mar 31 09:24:26 2003 *** Vid=32766, sid=f9e25944, tcon=5aba90, Tcon=594b40 ***
> 
> A grep of the Boslogs on the two machines show some regularity to the
> failures, on one of them at least, indicating some timer might be involved:
> (Its always a multiple of 5 minutes with the dump a second or two after.)
> 
> BosLog.old:Sun Mar 30 04:50:02 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 10:05:01 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 10:40:02 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 11:30:02 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 12:25:01 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 15:30:02 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 15:55:02 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 16:50:01 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 17:25:02 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 19:40:01 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 20:05:01 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 21:40:01 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 22:10:02 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 23:20:01 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Mon Mar 31 01:40:01 2003: fs:file exited on signal 6 (core dumped)
> BosLog:Mon Mar 31 09:25:01 2003: fs:file exited on signal 6 (core dumped)
> BosLog:Mon Mar 31 10:05:01 2003: fs:file exited on signal 6 (core dumped)
> 
> The other machine is not as regular:
> 
> BosLog.old:Sun Mar 30 12:44:56 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 13:20:58 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 14:32:06 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 16:43:22 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 17:58:31 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 18:55:10 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 22:06:32 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Sun Mar 30 23:30:12 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Mon Mar 31 01:01:23 2003: fs:file exited on signal 6 (core dumped)
> BosLog.old:Mon Mar 31 02:03:02 2003: fs:file exited on signal 6 (core dumped)
> BosLog:Mon Mar 31 04:52:52 2003: fs:file exited on signal 6 (core dumped)
> BosLog:Mon Mar 31 06:04:37 2003: fs:file exited on signal 6 (core dumped)
> BosLog:Mon Mar 31 08:20:52 2003: fs:file exited on signal 6 (core dumped)
> BosLog:Mon Mar 31 09:42:03 2003: fs:file exited on signal 6 (core dumped)
> 
> --
> 
>  Douglas E. Engert  <DEEngert@anl.gov>
>  Argonne National Laboratory
>  9700 South Cass Avenue
>  Argonne, Illinois  60439
>  (630) 252-5444
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info

-- 

 Douglas E. Engert  <DEEngert@anl.gov>
 Argonne National Laboratory
 9700 South Cass Avenue
 Argonne, Illinois  60439 
 (630) 252-5444