[OpenAFS] file-server: salvaging

Neulinger, Nathan nneul@umr.edu
Mon, 27 Jan 2003 10:00:32 -0600


No idea, but that's an interesting observation. Maybe someone a bit more
into the differences with the two will have an idea.=20

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216


> -----Original Message-----
> From: Klaas Hagemann [mailto:kerberos@northsailor.de]=20
> Sent: Monday, January 27, 2003 9:59 AM
> To: Neulinger, Nathan
> Cc: openafs-info@openafs.org
> Subject: Re: [OpenAFS] file-server: salvaging
>=20
>=20
> Neulinger, Nathan schrieb:
> > You might try running the LWP fileserver instead of the=20
> pthread one. It
> > may help you out.=20
>=20
> Ok, that seems to be a good hint. The system is running more stable
> for the past few minutes. But I am still a bit cautious if this fix
> really solves my problem and I can get the system back into=20
> production.
>=20
> Can you explain me, why this fix seems to help or where I can=20
> find more=20
> information about the difference between the LWP and the=20
> pthread version=20
> of fileserver ?
>=20
> Does it has to do with the effect, that in the pthread version the=20
> number of open files for each fileserver process grows steady=20
> to a limit=20
> of actually 928 (lsof | grep 15042 (PID of fileserver) | wc=20
> -l ) before=20
> the process hang, while in the LWP version the number of open files=20
> grows and shrinks again ?
>=20
> Thanks in advance,
>=20
> Klaas
>=20
> >=20
> > Build from source, and grab fileserver out of the viced/ directory
> > instead of the tviced/ one which is installed into dest/ by=20
> default.=20
> >=20
> > -- Nathan
> >=20
> > ------------------------------------------------------------
> > Nathan Neulinger                       EMail:  nneul@umr.edu
> > University of Missouri - Rolla         Phone: (573) 341-4841
> > Computing Services                       Fax: (573) 341-4216
> >=20
> >=20
> >=20
> >>-----Original Message-----
> >>From: Klaas Hagemann [mailto:kerberos@northsailor.de]=20
> >>Sent: Monday, January 27, 2003 8:20 AM
> >>To: Neulinger, Nathan
> >>Cc: Hartmut Reuter; openafs-info@openafs.org
> >>Subject: Re: [OpenAFS] file-server: salvaging
> >>
> >>
> >>Nathan Neulinger schrieb:
> >>
> >>>This problem is caused when the fileserver fails for=20
> >>
> >>whatever reason,
> >>
> >>>and due to pthreads, is unable to completely exit. Since it can't
> >>>completely exit, bosserver can't start a new one.=20
> >>>
> >>>Go in and killall -KILL fileserver. That will clear it up.=20
> >>
> >>Doesn't solve
> >>
> >>>the problem, but will get you back running without a reboot.
> >>
> >>Thanks.
> >>I got it so far, but the problem still occurs from time to=20
> >>time. and due=20
> >>to high availability i need to solve the problems
> >>
> >>
> >>>-- Nathan
> >>>
> >>>On Mon, 2003-01-27 at 06:05, Klaas Hagemann wrote:
> >>>
> >>>
> >>>>Hartmut Reuter schrieb:
> >>>>
> >>>>
> >>>>>If the fileservers have pid 1 as father they are probably=20
> >>
> >>left overs of=20
> >>
> >>>>>a restart and if this happened on sunday I would guess=20
> >>
> >>from the regular=20
> >>
> >>>>>restart at sunday morning at 4:00. (try bos getrestart).
> >>>>
> >>>>Is is not causes by the restart at sunday morning. It=20
> >>
> >>happens from time=20
> >>
> >>>>to time and i cannot reproduce it.
> >>>>
> >>>>I have posted some log files and i am preparing for getting some=20
> >>>>debugging information. But as far as i can see it, it seems=20
> >>
> >>so as if the=20
> >>
> >>>> fileserver prozess produces a memory address violation=20
> >>
> >>(segmentation=20
> >>
> >>>>fault).
> >>>>
> >>>>It did not happen in my testing enviroment, so i think it=20
> >>
> >>only happens=20
> >>
> >>>>when more clients are accessing the afs fileserver. So i=20
> >>
> >>would like to=20
> >>
> >>>>know if there are any kernel parameters to be set?
> >>>>
> >>>>
> >>>>Klaas
> >>>>
> >>>>
> >>>>
> >>>>>If the old fileservers don't go away the newly started=20
> >>
> >>fileservers will=20
> >>
> >>>>>give up after at time because of "bind failed". Then the=20
> >>
> >>new bosserver=20
> >>
> >>>>>will restart the fileserver and because the old one didn't=20
> >>
> >>regularly=20
> >>
> >>>>>shut down it will start first the salvager.
> >>>>>
> >>>>>So make sure the old fileservers go away (if nothing else=20
> >>
> >>helps kill=20
> >>
> >>>>>them by hand). Perhaps you better set restart to 'never'=20
> >>
> >>unless you have=20
> >>
> >>>>>solved the problem.
> >>>>>
> >>>>>Hartmut Reuter
> >>>>>
> >>>>>
> >>>>>
> >>>>>Klaas Hagemann wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Derrick J Brashear schrieb:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>On Fri, 24 Jan 2003, Klaas Hagemann wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>salvaging starts very often and file server prozesses=20
> >>
> >>are staying=20
> >>
> >>>>>>>>running but do not have the bosserver as ppid.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>either the bosserver is dying and somwething restarting=20
> >>
> >>it (doubt it) or
> >>
> >>>>>>>more likely the "main" pthread is dying but the rest=20
> >>
> >>stay running.=20
> >>
> >>>>>>>strace
> >>>>>>>output, core.(pid) or logs might be helpful.
> >>>>>>
> >>>>>>
> >>>>>>This sunday this error occured again on another file-server.
> >>>>>>All the file-server prozesses have the "1" as pid and the=20
> >>
> >>volumes are=20
> >>
> >>>>>>not accessible any more. I am not sure whether the=20
> >>
> >>bosserver was still=20
> >>
> >>>>>>running or not, because my kollegue restarted it.
> >>>>>>
> >>>>>>The AFS-Logs are empty, cause they were deleted on the=20
> >>
> >>new startup. I=20
> >>
> >>>>>>will keep them the next time.
> >>>>>>
> >>>>>>The file servers are running on suse linux 7.3. Are there any=20
> >>>>>>kernel-parameters which could be set? We had openafs=20
> >>
> >>running in our=20
> >>
> >>>>>>testing-enviroment without any problems, so i think this=20
> >>
> >>problem only=20
> >>
> >>>>>>occurs when many clients access the file-server.
> >>>>>>
> >>>>>>I will post any log-files when i get them, but any help=20
> >>
> >>or suggestions=20
> >>
> >>>>>>is very very welcome.
> >>>>>>
> >>>>>>Thanks
> >>>>>>Klaas
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>_______________________________________________
> >>>>>>>OpenAFS-info mailing list
> >>>>>>>OpenAFS-info@openafs.org
> >>>>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>_______________________________________________
> >>>>>>OpenAFS-info mailing list
> >>>>>>OpenAFS-info@openafs.org
> >>>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
> >>>>>
> >>>>>
> >>>>>
> >>>>_______________________________________________
> >>>>OpenAFS-info mailing list
> >>>>OpenAFS-info@openafs.org
> >>>>https://lists.openafs.org/mailman/listinfo/openafs-info
> >>
> >>
> >>
> > _______________________________________________
> > OpenAFS-info mailing list
> > OpenAFS-info@openafs.org
> > https://lists.openafs.org/mailman/listinfo/openafs-info
> >=20
>=20
>=20
>=20