[OpenAFS] Fileserver: frequent crashes
Erwin Broschinski
broschi@id.ethz.ch
Fri, 15 Oct 2004 14:55:08 +0200 (MEST)
Hi
we are running all (but one) fileservers on Solaris 8 with OpenAFS-1.2.11. The
software is from openafs.org's website.
For a few weeks now, we experience frequent fs crashes, after months of living
very comfortably.
I have backtraced the fs cores on 2 different Solaris machines and found them
to be (almost) identical - here is one:
(gdb) thread apply all where
Thread 16 (process 586978 ):
#0 0xff19f474 in ?? ()
#1 0xff0c93ac in ?? ()
#2 0xff0c81b4 in ?? ()
#3 0xff0c8078 in ?? ()
#4 0x0007748c in rx_NewService ()
#5 0x00076c24 in rxi_DestroyConnectionNoLock ()
#6 0x000744f8 in rxi_InitPeerParams ()
#7 0x00073e58 in rx_GetIFInfo ()
Thread 15 (process 521442 ):
#0 0xff19f474 in ?? ()
#1 0xff0c93ac in ?? ()
#2 0xff0c81b4 in ?? ()
#3 0xff0c8078 in ?? ()
#4 0x0007748c in rx_NewService ()
#5 0x00076c24 in rxi_DestroyConnectionNoLock ()
#6 0x000744f8 in rxi_InitPeerParams ()
#7 0x00073e58 in rx_GetIFInfo ()
Thread 14 (process 455906 ):
#0 0xff19f474 in ?? ()
#1 0xff0c93ac in ?? ()
#2 0xff0c81b4 in ?? ()
#3 0xff0c8078 in ?? ()
#4 0x0007748c in rx_NewService ()
#5 0x00076c24 in rxi_DestroyConnectionNoLock ()
#6 0x000744f8 in rxi_InitPeerParams ()
#7 0x00073e58 in rx_GetIFInfo ()
Thread 13 (process 390370 ):
#0 0xff0d9200 in ?? ()
Thread 12 (process 324834 ):
#0 0xff19f474 in ?? ()
#1 0xff0c93ac in ?? ()
#2 0xff0c81b4 in ?? ()
#3 0xff0c8078 in ?? ()
#4 0x0007748c in rx_NewService ()
#5 0x00076c24 in rxi_DestroyConnectionNoLock ()
#6 0x000744f8 in rxi_InitPeerParams ()
#7 0x00073e58 in rx_GetIFInfo ()
Thread 11 (process 259298 ):
#0 0xff19f474 in ?? ()
#1 0xff0c93ac in ?? ()
#2 0xff0c81b4 in ?? ()
#3 0xff0c8078 in ?? ()
#4 0x0007748c in rx_NewService ()
#5 0x00076c24 in rxi_DestroyConnectionNoLock ()
#6 0x000744f8 in rxi_InitPeerParams ()
#7 0x00073e58 in rx_GetIFInfo ()
Thread 10 (process 193762 ):
#0 0xff19edc4 in ?? ()
Thread 9 (process 128226 ):
#0 0xff19c2b4 in ?? ()
#1 0x000859f8 in rxi_AllocDataBuf ()
#2 0x000742ac in rx_GetIFInfo ()
#3 0x0007450c in rxi_InitPeerParams ()
#4 0x00073e58 in rx_GetIFInfo ()
Thread 8 (process 1111266 ):
#0 0xff19c968 in ?? ()
#1 0xff0ca360 in ?? ()
Thread 7 (process 1045730 ):
#0 0xff19f474 in ?? ()
#1 0xff0c93ac in ?? ()
#2 0xff0c81b4 in ?? ()
#3 0xff0c8078 in ?? ()
#4 0x0007748c in rx_NewService ()
#5 0x00076c24 in rxi_DestroyConnectionNoLock ()
#6 0x000743d0 in rxi_InitPeerParams ()
Thread 6 (process 980194 ):
#0 0xff19f474 in ?? ()
#1 0xff0c93ac in ?? ()
#2 0xff0c81b4 in ?? ()
#3 0xff0c8078 in ?? ()
#4 0x0007748c in rx_NewService ()
#5 0x00076c24 in rxi_DestroyConnectionNoLock ()
#6 0x000744f8 in rxi_InitPeerParams ()
#7 0x00073e58 in rx_GetIFInfo ()
Thread 5 (process 914658 ):
#0 0xff19f474 in ?? ()
#1 0xff0c93ac in ?? ()
#2 0xff0c81b4 in ?? ()
#3 0xff0c8078 in ?? ()
#4 0x0007748c in rx_NewService ()
#5 0x00076c24 in rxi_DestroyConnectionNoLock ()
#6 0x000744f8 in rxi_InitPeerParams ()
#7 0x00073e58 in rx_GetIFInfo ()
Thread 4 (process 849122 ):
#0 0xff19d600 in ?? ()
#1 0xff0daa30 in ?? ()
Thread 3 (process 783586 ):
#0 0xff19f474 in ?? ()
#1 0xff0c93ac in ?? ()
#2 0xff0c81b4 in ?? ()
#3 0xff0c8078 in ?? ()
#4 0x0007748c in rx_NewService ()
#5 0x00076c24 in rxi_DestroyConnectionNoLock ()
#6 0x000744f8 in rxi_InitPeerParams ()
#7 0x00073e58 in rx_GetIFInfo ()
Thread 2 (process 718050 ):
#0 0xff19f474 in ?? ()
#1 0xff0c93ac in ?? ()
#2 0xff0c81b4 in ?? ()
#3 0xff0c8078 in ?? ()
#4 0x0007748c in rx_NewService ()
#5 0x00076c24 in rxi_DestroyConnectionNoLock ()
#6 0x000744f8 in rxi_InitPeerParams ()
#7 0x00073e58 in rx_GetIFInfo ()
Thread 1 (process 652514 ):
#0 0xff142bbc in ?? ()
#1 0xff142b74 in ?? ()
#2 0x0007858c in rx_Finalize ()
#3 0x0008ea68 in rxkad_CheckResponse ()
#4 0x00075bbc in rx_Init ()
#5 0x00080480 in rxi_ChallengeEvent ()
#6 0x00099be4 in _RXSTATS_ClearProcessRPCStats ()
#7 0x00074058 in rx_GetIFInfo ()
(gdb)
We have some *very* frequently accessed volumes. They contain Windows software
for the student's labs e.g.:
#>vos exa ntsw-MiKTeX
ntsw-MiKTeX 537114642 RW 625551 K On-line
nethzafs-004.ethz.ch /vicepa
RWrite 537114642 ROnly 0 Backup 0
MaxQuota 1000000 K
Creation Wed Oct 6 10:39:00 2004
Last Update Wed Oct 6 17:20:10 2004
1630676 accesses in the past day (i.e., vnode references)
^^^^^^^
Clients in the student labs are 1.3.71
I have moved this volume away from the server that crashed this morning
to a server that only handles replicas. If that crashes, only this one volume
will be inaccessible for a while.
Frequently accessing a volume should not crash a fileserver anyhow??
Anything else I can do?
Erwin
''`'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O-O~~~~~~~
Erwin Broschinski Tel: +41 1 632 4281
Swiss Fed. Inst. of Technology Fax: +41 1 632 1022
ETH Zentrum CLU B2 E-Mail: broschi@id.ethz.ch
8092 Zurich PGP-key:
Switzerland www.tik.ee.ethz.ch/~pgp/Search.html
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Ceterum censeo, 'Parvam Mollim' esse delendam." (nach Cicero)