[OpenAFS] file-server: salvaging

Klaas Hagemann kerberos@northsailor.de
Mon, 27 Jan 2003 15:19:47 +0100


Nathan Neulinger schrieb:
> This problem is caused when the fileserver fails for whatever reason,
> and due to pthreads, is unable to completely exit. Since it can't
> completely exit, bosserver can't start a new one. 
> 
> Go in and killall -KILL fileserver. That will clear it up. Doesn't solve
> the problem, but will get you back running without a reboot.

Thanks.
I got it so far, but the problem still occurs from time to time. and due 
to high availability i need to solve the problems

> 
> -- Nathan
> 
> On Mon, 2003-01-27 at 06:05, Klaas Hagemann wrote:
> 
>>Hartmut Reuter schrieb:
>>
>>>If the fileservers have pid 1 as father they are probably left overs of 
>>>a restart and if this happened on sunday I would guess from the regular 
>>>restart at sunday morning at 4:00. (try bos getrestart).
>>
>>Is is not causes by the restart at sunday morning. It happens from time 
>>to time and i cannot reproduce it.
>>
>>I have posted some log files and i am preparing for getting some 
>>debugging information. But as far as i can see it, it seems so as if the 
>>  fileserver prozess produces a memory address violation (segmentation 
>>fault).
>>
>>It did not happen in my testing enviroment, so i think it only happens 
>>when more clients are accessing the afs fileserver. So i would like to 
>>know if there are any kernel parameters to be set?
>>
>>
>>Klaas
>>
>>
>>>If the old fileservers don't go away the newly started fileservers will 
>>>give up after at time because of "bind failed". Then the new bosserver 
>>>will restart the fileserver and because the old one didn't regularly 
>>>shut down it will start first the salvager.
>>>
>>>So make sure the old fileservers go away (if nothing else helps kill 
>>>them by hand). Perhaps you better set restart to 'never' unless you have 
>>>solved the problem.
>>>
>>>Hartmut Reuter
>>>
>>>
>>>
>>>Klaas Hagemann wrote:
>>>
>>>
>>>>Derrick J Brashear schrieb:
>>>>
>>>>
>>>>>On Fri, 24 Jan 2003, Klaas Hagemann wrote:
>>>>>
>>>>>
>>>>>
>>>>>>salvaging starts very often and file server prozesses are staying 
>>>>>>running but do not have the bosserver as ppid.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>either the bosserver is dying and somwething restarting it (doubt it) or
>>>>>more likely the "main" pthread is dying but the rest stay running. 
>>>>>strace
>>>>>output, core.(pid) or logs might be helpful.
>>>>
>>>>
>>>>This sunday this error occured again on another file-server.
>>>>All the file-server prozesses have the "1" as pid and the volumes are 
>>>>not accessible any more. I am not sure whether the bosserver was still 
>>>>running or not, because my kollegue restarted it.
>>>>
>>>>The AFS-Logs are empty, cause they were deleted on the new startup. I 
>>>>will keep them the next time.
>>>>
>>>>The file servers are running on suse linux 7.3. Are there any 
>>>>kernel-parameters which could be set? We had openafs running in our 
>>>>testing-enviroment without any problems, so i think this problem only 
>>>>occurs when many clients access the file-server.
>>>>
>>>>I will post any log-files when i get them, but any help or suggestions 
>>>>is very very welcome.
>>>>
>>>>Thanks
>>>>Klaas
>>>>
>>>>
>>>>>
>>>>>_______________________________________________
>>>>>OpenAFS-info mailing list
>>>>>OpenAFS-info@openafs.org
>>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
>>>>>
>>>>
>>>>
>>>>_______________________________________________
>>>>OpenAFS-info mailing list
>>>>OpenAFS-info@openafs.org
>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
>>>
>>>
>>>
>>
>>_______________________________________________
>>OpenAFS-info mailing list
>>OpenAFS-info@openafs.org
>>https://lists.openafs.org/mailman/listinfo/openafs-info