[OpenAFS] Fileserver: frequent crashes

Erwin Broschinski broschi@id.ethz.ch
Fri, 15 Oct 2004 14:55:08 +0200 (MEST)


Hi

we are running all (but one) fileservers on Solaris 8 with OpenAFS-1.2.11. The
software is from openafs.org's website.
For a few weeks now, we experience frequent fs crashes, after months of living
very comfortably.

I have backtraced the fs cores on 2 different Solaris machines and found them
to be (almost) identical - here is one:

(gdb) thread apply all where

Thread 16 (process 586978    ):
#0  0xff19f474 in ?? ()
#1  0xff0c93ac in ?? ()
#2  0xff0c81b4 in ?? ()
#3  0xff0c8078 in ?? ()
#4  0x0007748c in rx_NewService ()
#5  0x00076c24 in rxi_DestroyConnectionNoLock ()
#6  0x000744f8 in rxi_InitPeerParams ()
#7  0x00073e58 in rx_GetIFInfo ()

Thread 15 (process 521442    ):
#0  0xff19f474 in ?? ()
#1  0xff0c93ac in ?? ()
#2  0xff0c81b4 in ?? ()
#3  0xff0c8078 in ?? ()
#4  0x0007748c in rx_NewService ()
#5  0x00076c24 in rxi_DestroyConnectionNoLock ()
#6  0x000744f8 in rxi_InitPeerParams ()
#7  0x00073e58 in rx_GetIFInfo ()

Thread 14 (process 455906    ):
#0  0xff19f474 in ?? ()
#1  0xff0c93ac in ?? ()
#2  0xff0c81b4 in ?? ()
#3  0xff0c8078 in ?? ()
#4  0x0007748c in rx_NewService ()
#5  0x00076c24 in rxi_DestroyConnectionNoLock ()
#6  0x000744f8 in rxi_InitPeerParams ()
#7  0x00073e58 in rx_GetIFInfo ()

Thread 13 (process 390370    ):
#0  0xff0d9200 in ?? ()

Thread 12 (process 324834    ):
#0  0xff19f474 in ?? ()
#1  0xff0c93ac in ?? ()
#2  0xff0c81b4 in ?? ()
#3  0xff0c8078 in ?? ()
#4  0x0007748c in rx_NewService ()
#5  0x00076c24 in rxi_DestroyConnectionNoLock ()
#6  0x000744f8 in rxi_InitPeerParams ()
#7  0x00073e58 in rx_GetIFInfo ()

Thread 11 (process 259298    ):
#0  0xff19f474 in ?? ()
#1  0xff0c93ac in ?? ()
#2  0xff0c81b4 in ?? ()
#3  0xff0c8078 in ?? ()
#4  0x0007748c in rx_NewService ()
#5  0x00076c24 in rxi_DestroyConnectionNoLock ()
#6  0x000744f8 in rxi_InitPeerParams ()
#7  0x00073e58 in rx_GetIFInfo ()

Thread 10 (process 193762    ):
#0  0xff19edc4 in ?? ()

Thread 9 (process 128226    ):
#0  0xff19c2b4 in ?? ()
#1  0x000859f8 in rxi_AllocDataBuf ()
#2  0x000742ac in rx_GetIFInfo ()
#3  0x0007450c in rxi_InitPeerParams ()
#4  0x00073e58 in rx_GetIFInfo ()

Thread 8 (process 1111266    ):
#0  0xff19c968 in ?? ()
#1  0xff0ca360 in ?? ()

Thread 7 (process 1045730    ):
#0  0xff19f474 in ?? ()
#1  0xff0c93ac in ?? ()
#2  0xff0c81b4 in ?? ()
#3  0xff0c8078 in ?? ()
#4  0x0007748c in rx_NewService ()
#5  0x00076c24 in rxi_DestroyConnectionNoLock ()
#6  0x000743d0 in rxi_InitPeerParams ()

Thread 6 (process 980194    ):
#0  0xff19f474 in ?? ()
#1  0xff0c93ac in ?? ()
#2  0xff0c81b4 in ?? ()
#3  0xff0c8078 in ?? ()
#4  0x0007748c in rx_NewService ()
#5  0x00076c24 in rxi_DestroyConnectionNoLock ()
#6  0x000744f8 in rxi_InitPeerParams ()
#7  0x00073e58 in rx_GetIFInfo ()

Thread 5 (process 914658    ):
#0  0xff19f474 in ?? ()
#1  0xff0c93ac in ?? ()
#2  0xff0c81b4 in ?? ()
#3  0xff0c8078 in ?? ()
#4  0x0007748c in rx_NewService ()
#5  0x00076c24 in rxi_DestroyConnectionNoLock ()
#6  0x000744f8 in rxi_InitPeerParams ()
#7  0x00073e58 in rx_GetIFInfo ()

Thread 4 (process 849122    ):
#0  0xff19d600 in ?? ()
#1  0xff0daa30 in ?? ()

Thread 3 (process 783586    ):
#0  0xff19f474 in ?? ()
#1  0xff0c93ac in ?? ()
#2  0xff0c81b4 in ?? ()
#3  0xff0c8078 in ?? ()
#4  0x0007748c in rx_NewService ()
#5  0x00076c24 in rxi_DestroyConnectionNoLock ()
#6  0x000744f8 in rxi_InitPeerParams ()
#7  0x00073e58 in rx_GetIFInfo ()

Thread 2 (process 718050    ):
#0  0xff19f474 in ?? ()
#1  0xff0c93ac in ?? ()
#2  0xff0c81b4 in ?? ()
#3  0xff0c8078 in ?? ()
#4  0x0007748c in rx_NewService ()
#5  0x00076c24 in rxi_DestroyConnectionNoLock ()
#6  0x000744f8 in rxi_InitPeerParams ()
#7  0x00073e58 in rx_GetIFInfo ()

Thread 1 (process 652514    ):
#0  0xff142bbc in ?? ()
#1  0xff142b74 in ?? ()
#2  0x0007858c in rx_Finalize ()
#3  0x0008ea68 in rxkad_CheckResponse ()
#4  0x00075bbc in rx_Init ()
#5  0x00080480 in rxi_ChallengeEvent ()
#6  0x00099be4 in _RXSTATS_ClearProcessRPCStats ()
#7  0x00074058 in rx_GetIFInfo ()
(gdb) 

We have some *very* frequently accessed volumes. They contain Windows software
for the student's labs e.g.:

#>vos exa ntsw-MiKTeX
ntsw-MiKTeX                       537114642 RW     625551 K  On-line
    nethzafs-004.ethz.ch /vicepa 
    RWrite  537114642 ROnly          0 Backup          0 
    MaxQuota    1000000 K 
    Creation    Wed Oct  6 10:39:00 2004
    Last Update Wed Oct  6 17:20:10 2004
    1630676 accesses in the past day (i.e., vnode references)
    ^^^^^^^

Clients in the student labs are 1.3.71

I have moved this volume away from the server that crashed this morning 
to a server that only handles replicas. If that crashes, only this one volume
will be inaccessible for a while.

Frequently accessing a volume should not crash a fileserver anyhow??

Anything else I can do?


Erwin
                                                         ''`'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O-O~~~~~~~
Erwin Broschinski               Tel:    +41 1 632 4281
Swiss Fed. Inst. of Technology  Fax:    +41 1 632 1022 
ETH Zentrum CLU B2              E-Mail: broschi@id.ethz.ch
8092 Zurich                     PGP-key:  
Switzerland                     www.tik.ee.ethz.ch/~pgp/Search.html
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

"Ceterum censeo, 'Parvam Mollim' esse delendam."  (nach Cicero)