[OpenAFS] file-server: salvaging

Klaas Hagemann kerberos@northsailor.de
Mon, 27 Jan 2003 16:58:50 +0100


Neulinger, Nathan schrieb:
> You might try running the LWP fileserver instead of the pthread one. It
> may help you out. 

Ok, that seems to be a good hint. The system is running more stable
for the past few minutes. But I am still a bit cautious if this fix
really solves my problem and I can get the system back into production.

Can you explain me, why this fix seems to help or where I can find more 
information about the difference between the LWP and the pthread version 
of fileserver ?

Does it has to do with the effect, that in the pthread version the 
number of open files for each fileserver process grows steady to a limit 
of actually 928 (lsof | grep 15042 (PID of fileserver) | wc -l ) before 
the process hang, while in the LWP version the number of open files 
grows and shrinks again ?

Thanks in advance,

Klaas

> 
> Build from source, and grab fileserver out of the viced/ directory
> instead of the tviced/ one which is installed into dest/ by default. 
> 
> -- Nathan
> 
> ------------------------------------------------------------
> Nathan Neulinger                       EMail:  nneul@umr.edu
> University of Missouri - Rolla         Phone: (573) 341-4841
> Computing Services                       Fax: (573) 341-4216
> 
> 
> 
>>-----Original Message-----
>>From: Klaas Hagemann [mailto:kerberos@northsailor.de] 
>>Sent: Monday, January 27, 2003 8:20 AM
>>To: Neulinger, Nathan
>>Cc: Hartmut Reuter; openafs-info@openafs.org
>>Subject: Re: [OpenAFS] file-server: salvaging
>>
>>
>>Nathan Neulinger schrieb:
>>
>>>This problem is caused when the fileserver fails for 
>>
>>whatever reason,
>>
>>>and due to pthreads, is unable to completely exit. Since it can't
>>>completely exit, bosserver can't start a new one. 
>>>
>>>Go in and killall -KILL fileserver. That will clear it up. 
>>
>>Doesn't solve
>>
>>>the problem, but will get you back running without a reboot.
>>
>>Thanks.
>>I got it so far, but the problem still occurs from time to 
>>time. and due 
>>to high availability i need to solve the problems
>>
>>
>>>-- Nathan
>>>
>>>On Mon, 2003-01-27 at 06:05, Klaas Hagemann wrote:
>>>
>>>
>>>>Hartmut Reuter schrieb:
>>>>
>>>>
>>>>>If the fileservers have pid 1 as father they are probably 
>>
>>left overs of 
>>
>>>>>a restart and if this happened on sunday I would guess 
>>
>>from the regular 
>>
>>>>>restart at sunday morning at 4:00. (try bos getrestart).
>>>>
>>>>Is is not causes by the restart at sunday morning. It 
>>
>>happens from time 
>>
>>>>to time and i cannot reproduce it.
>>>>
>>>>I have posted some log files and i am preparing for getting some 
>>>>debugging information. But as far as i can see it, it seems 
>>
>>so as if the 
>>
>>>> fileserver prozess produces a memory address violation 
>>
>>(segmentation 
>>
>>>>fault).
>>>>
>>>>It did not happen in my testing enviroment, so i think it 
>>
>>only happens 
>>
>>>>when more clients are accessing the afs fileserver. So i 
>>
>>would like to 
>>
>>>>know if there are any kernel parameters to be set?
>>>>
>>>>
>>>>Klaas
>>>>
>>>>
>>>>
>>>>>If the old fileservers don't go away the newly started 
>>
>>fileservers will 
>>
>>>>>give up after at time because of "bind failed". Then the 
>>
>>new bosserver 
>>
>>>>>will restart the fileserver and because the old one didn't 
>>
>>regularly 
>>
>>>>>shut down it will start first the salvager.
>>>>>
>>>>>So make sure the old fileservers go away (if nothing else 
>>
>>helps kill 
>>
>>>>>them by hand). Perhaps you better set restart to 'never' 
>>
>>unless you have 
>>
>>>>>solved the problem.
>>>>>
>>>>>Hartmut Reuter
>>>>>
>>>>>
>>>>>
>>>>>Klaas Hagemann wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Derrick J Brashear schrieb:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>On Fri, 24 Jan 2003, Klaas Hagemann wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>salvaging starts very often and file server prozesses 
>>
>>are staying 
>>
>>>>>>>>running but do not have the bosserver as ppid.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>either the bosserver is dying and somwething restarting 
>>
>>it (doubt it) or
>>
>>>>>>>more likely the "main" pthread is dying but the rest 
>>
>>stay running. 
>>
>>>>>>>strace
>>>>>>>output, core.(pid) or logs might be helpful.
>>>>>>
>>>>>>
>>>>>>This sunday this error occured again on another file-server.
>>>>>>All the file-server prozesses have the "1" as pid and the 
>>
>>volumes are 
>>
>>>>>>not accessible any more. I am not sure whether the 
>>
>>bosserver was still 
>>
>>>>>>running or not, because my kollegue restarted it.
>>>>>>
>>>>>>The AFS-Logs are empty, cause they were deleted on the 
>>
>>new startup. I 
>>
>>>>>>will keep them the next time.
>>>>>>
>>>>>>The file servers are running on suse linux 7.3. Are there any 
>>>>>>kernel-parameters which could be set? We had openafs 
>>
>>running in our 
>>
>>>>>>testing-enviroment without any problems, so i think this 
>>
>>problem only 
>>
>>>>>>occurs when many clients access the file-server.
>>>>>>
>>>>>>I will post any log-files when i get them, but any help 
>>
>>or suggestions 
>>
>>>>>>is very very welcome.
>>>>>>
>>>>>>Thanks
>>>>>>Klaas
>>>>>>
>>>>>>
>>>>>>
>>>>>>>_______________________________________________
>>>>>>>OpenAFS-info mailing list
>>>>>>>OpenAFS-info@openafs.org
>>>>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
>>>>>>>
>>>>>>
>>>>>>
>>>>>>_______________________________________________
>>>>>>OpenAFS-info mailing list
>>>>>>OpenAFS-info@openafs.org
>>>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
>>>>>
>>>>>
>>>>>
>>>>_______________________________________________
>>>>OpenAFS-info mailing list
>>>>OpenAFS-info@openafs.org
>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
>>
>>
>>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>