[OpenAFS-devel] issues with crashing fileserver...
Neulinger, Nathan
nneul@umr.edu
Wed, 17 Apr 2002 18:16:18 -0500
I've got a situation with fileservers crashing that is something of
concern. (The fileserver crashing is bad, but this is more important at
the moment.) Basically, the fileserver gets a segv at some point or
another, and partially terminates. Bos thinks it's gone away, so tries
to restart. Problem is - a good hunk of the server is still there.=20
Do y'all think there is any reasonable way to have the server forcibly
terminate itself (all threads/procs/etc.) if a part of it goes away?
Basically I wind up seeing this in the logs:
Apr 17 17:31:07 afs8 bosserver[539]: fs:file exited on signal 11=20
Apr 17 17:31:07 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:31:07 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:32:38 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:32:38 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:32:38 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:34:08 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:34:08 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:34:08 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:35:38 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:35:38 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:35:38 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:37:08 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:37:08 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:37:08 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:38:38 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:38:38 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:38:38 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:40:08 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:40:08 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:40:08 afs8 bosserver[539]: fs:salv exited on signal 13=20
Notice - file dies once with a SEGV, then it's unable to get the server
sane again.
Unfortunately, this is on linux, and there are no core dumps since it's
a threaded process.=20
-- Nathan
------------------------------------------------------------
Nathan Neulinger EMail: nneul@umr.edu
University of Missouri - Rolla Phone: (573) 341-4841
Computing Services Fax: (573) 341-4216