[OpenAFS] Bos server failure

Steve Devine sdevine@msu.edu
Tue, 16 Nov 2004 09:46:26 -0500


Jeffrey Hutzelman wrote:

>
>
> On Monday, November 08, 2004 09:49:32 -0500 Steve Devine 
> <sdevine@msu.edu> wrote:
>
>> All,
>> Odd problem here.
>> I have a file server where the bos server has stopped. It still seems to
>> be serving files ok. vlserver is not running either. The last time this
>> happened I restarted bosserver and this started a chain reaction of
>> Salvaging that never actually stopped. The best solution is a reboot but
>> if the server will hold off till tonight I would like to wait untill 
>> then
>> .. classes are in full swing right now.
>>
>> SO here is my question .. can I remove SALVAGE.fs and start bosserver
>> thereby avoiding the salvage routine? Or can I kill the fileserver
>> processs one by one and then restart bosserver?
>>
>> Or is this an invitation to more headaches.?
>
>
> Don't do that.  If the bosserver is dead and you try to start a new 
> one, then it will not know about the other servers that are still 
> running, and will attempt to start new ones.  This is why you were 
> getting the "endless salvage" - the fileserver would start up, find 
> out that its port is unavailable, and exit.
>
> The safest course of action is to manually kill off any of the 
> remaining servers that normally run under the bosserver's control, 
> then restart the bosserver (possibly after a reboot).  As long as the 
> other services are still working, it is safe to wait arbitrarily long 
> before doing this.
>
> Note that while most of the servers will shut down cleanly if sent 
> SIGTERM, the fileserver is special and needs to be sent SIGQUIT to 
> make it shut down.
>
> -- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
>   Sr. Research Systems Programmer
>   School of Computer Science - Research Computing Facility
>   Carnegie Mellon University - Pittsburgh, PA

Jeffrey,

Thanks for the reply  .. that is exactly what I did.  I waited till the 
next morning and rebooted the servers at 5 am . They all had to run 
salvager but they did return. I also upgraded them to Version 1.2.13 and 
so now we are in the 'wait and see' mode. I also set the startup script 
to allow core files so if they do fail we will have a little more to 
work with.
/sd

-- 
Steve Devine
Storage Systems
Academic Computing & Network Services
Michigan State University

301 Computer Center
East Lansing, MI 48824-1042
1-517-355-4500  (x242)

Baseball is ninety percent mental; the other half is physical.
- Yogi Berra