[OpenAFS-devel] issues with crashing fileserver...

Neulinger, Nathan nneul@umr.edu
Wed, 17 Apr 2002 18:16:18 -0500


I've got a situation with fileservers crashing that is something of
concern. (The fileserver crashing is bad, but this is more important at
the moment.) Basically, the fileserver gets a segv at some point or
another, and partially terminates. Bos thinks it's gone away, so tries
to restart. Problem is - a good hunk of the server is still there.=20

Do y'all think there is any reasonable way to have the server forcibly
terminate itself (all threads/procs/etc.) if a part of it goes away?

Basically I wind up seeing this in the logs:

Apr 17 17:31:07 afs8 bosserver[539]: fs:file exited on signal 11=20
Apr 17 17:31:07 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:31:07 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:32:38 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:32:38 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:32:38 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:34:08 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:34:08 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:34:08 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:35:38 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:35:38 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:35:38 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:37:08 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:37:08 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:37:08 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:38:38 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:38:38 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:38:38 afs8 bosserver[539]: fs:salv exited on signal 13=20
Apr 17 17:40:08 afs8 bosserver[539]: fs:file exited on signal 13=20
Apr 17 17:40:08 afs8 bosserver[539]: fs:vol exited on signal 15=20
Apr 17 17:40:08 afs8 bosserver[539]: fs:salv exited on signal 13=20

Notice - file dies once with a SEGV, then it's unable to get the server
sane again.

Unfortunately, this is on linux, and there are no core dumps since it's
a threaded process.=20

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216