[OpenAFS] file-server: salvaging
Klaas Hagemann
kerberos@northsailor.de
Mon, 27 Jan 2003 16:58:50 +0100
Neulinger, Nathan schrieb:
> You might try running the LWP fileserver instead of the pthread one. It
> may help you out.
Ok, that seems to be a good hint. The system is running more stable
for the past few minutes. But I am still a bit cautious if this fix
really solves my problem and I can get the system back into production.
Can you explain me, why this fix seems to help or where I can find more
information about the difference between the LWP and the pthread version
of fileserver ?
Does it has to do with the effect, that in the pthread version the
number of open files for each fileserver process grows steady to a limit
of actually 928 (lsof | grep 15042 (PID of fileserver) | wc -l ) before
the process hang, while in the LWP version the number of open files
grows and shrinks again ?
Thanks in advance,
Klaas
>
> Build from source, and grab fileserver out of the viced/ directory
> instead of the tviced/ one which is installed into dest/ by default.
>
> -- Nathan
>
> ------------------------------------------------------------
> Nathan Neulinger EMail: nneul@umr.edu
> University of Missouri - Rolla Phone: (573) 341-4841
> Computing Services Fax: (573) 341-4216
>
>
>
>>-----Original Message-----
>>From: Klaas Hagemann [mailto:kerberos@northsailor.de]
>>Sent: Monday, January 27, 2003 8:20 AM
>>To: Neulinger, Nathan
>>Cc: Hartmut Reuter; openafs-info@openafs.org
>>Subject: Re: [OpenAFS] file-server: salvaging
>>
>>
>>Nathan Neulinger schrieb:
>>
>>>This problem is caused when the fileserver fails for
>>
>>whatever reason,
>>
>>>and due to pthreads, is unable to completely exit. Since it can't
>>>completely exit, bosserver can't start a new one.
>>>
>>>Go in and killall -KILL fileserver. That will clear it up.
>>
>>Doesn't solve
>>
>>>the problem, but will get you back running without a reboot.
>>
>>Thanks.
>>I got it so far, but the problem still occurs from time to
>>time. and due
>>to high availability i need to solve the problems
>>
>>
>>>-- Nathan
>>>
>>>On Mon, 2003-01-27 at 06:05, Klaas Hagemann wrote:
>>>
>>>
>>>>Hartmut Reuter schrieb:
>>>>
>>>>
>>>>>If the fileservers have pid 1 as father they are probably
>>
>>left overs of
>>
>>>>>a restart and if this happened on sunday I would guess
>>
>>from the regular
>>
>>>>>restart at sunday morning at 4:00. (try bos getrestart).
>>>>
>>>>Is is not causes by the restart at sunday morning. It
>>
>>happens from time
>>
>>>>to time and i cannot reproduce it.
>>>>
>>>>I have posted some log files and i am preparing for getting some
>>>>debugging information. But as far as i can see it, it seems
>>
>>so as if the
>>
>>>> fileserver prozess produces a memory address violation
>>
>>(segmentation
>>
>>>>fault).
>>>>
>>>>It did not happen in my testing enviroment, so i think it
>>
>>only happens
>>
>>>>when more clients are accessing the afs fileserver. So i
>>
>>would like to
>>
>>>>know if there are any kernel parameters to be set?
>>>>
>>>>
>>>>Klaas
>>>>
>>>>
>>>>
>>>>>If the old fileservers don't go away the newly started
>>
>>fileservers will
>>
>>>>>give up after at time because of "bind failed". Then the
>>
>>new bosserver
>>
>>>>>will restart the fileserver and because the old one didn't
>>
>>regularly
>>
>>>>>shut down it will start first the salvager.
>>>>>
>>>>>So make sure the old fileservers go away (if nothing else
>>
>>helps kill
>>
>>>>>them by hand). Perhaps you better set restart to 'never'
>>
>>unless you have
>>
>>>>>solved the problem.
>>>>>
>>>>>Hartmut Reuter
>>>>>
>>>>>
>>>>>
>>>>>Klaas Hagemann wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Derrick J Brashear schrieb:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>On Fri, 24 Jan 2003, Klaas Hagemann wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>salvaging starts very often and file server prozesses
>>
>>are staying
>>
>>>>>>>>running but do not have the bosserver as ppid.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>either the bosserver is dying and somwething restarting
>>
>>it (doubt it) or
>>
>>>>>>>more likely the "main" pthread is dying but the rest
>>
>>stay running.
>>
>>>>>>>strace
>>>>>>>output, core.(pid) or logs might be helpful.
>>>>>>
>>>>>>
>>>>>>This sunday this error occured again on another file-server.
>>>>>>All the file-server prozesses have the "1" as pid and the
>>
>>volumes are
>>
>>>>>>not accessible any more. I am not sure whether the
>>
>>bosserver was still
>>
>>>>>>running or not, because my kollegue restarted it.
>>>>>>
>>>>>>The AFS-Logs are empty, cause they were deleted on the
>>
>>new startup. I
>>
>>>>>>will keep them the next time.
>>>>>>
>>>>>>The file servers are running on suse linux 7.3. Are there any
>>>>>>kernel-parameters which could be set? We had openafs
>>
>>running in our
>>
>>>>>>testing-enviroment without any problems, so i think this
>>
>>problem only
>>
>>>>>>occurs when many clients access the file-server.
>>>>>>
>>>>>>I will post any log-files when i get them, but any help
>>
>>or suggestions
>>
>>>>>>is very very welcome.
>>>>>>
>>>>>>Thanks
>>>>>>Klaas
>>>>>>
>>>>>>
>>>>>>
>>>>>>>_______________________________________________
>>>>>>>OpenAFS-info mailing list
>>>>>>>OpenAFS-info@openafs.org
>>>>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
>>>>>>>
>>>>>>
>>>>>>
>>>>>>_______________________________________________
>>>>>>OpenAFS-info mailing list
>>>>>>OpenAFS-info@openafs.org
>>>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
>>>>>
>>>>>
>>>>>
>>>>_______________________________________________
>>>>OpenAFS-info mailing list
>>>>OpenAFS-info@openafs.org
>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
>>
>>
>>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>