[OpenAFS] fileserver goes down overnight
david l goodrich
dlg@dsrw.org
Fri, 27 Mar 2009 10:14:18 -0500
--sGwo475CiIwWEjLI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Tue, Mar 24, 2009 at 06:32:27PM -0500, david l goodrich wrote:
> On Tue, Mar 24, 2009 at 07:15:46PM -0400, Jason Edgecombe wrote:
> > david l goodrich wrote:
> >> On Tue, Mar 24, 2009 at 10:39:24AM -0700, Russ Allbery wrote:
> >> =20
> >>> david l goodrich <dlg@dsrw.org> writes:
> >>>
> >>> =20
> >>>> The past two nights, I've had one of my AFS fileserver go "down"
> >>>>
> >>>> I say "down" and not down because it's not totally nonfunctional.
> >>>>
> >>>> It thinks it's running fine:
> >>>>
> >>>> sprawl# bos status localhost -localauth
> >>>> Instance fs, currently running normally.
> >>>> Auxiliary status is: file server running.
> >>>> =20
> >>> bos status -long is generally more useful. However:
> >>> =20
> >> Can do:
> >> sprawl# bos status localhost -localauth -long
> >> Instance fs, (type is fs) currently running normally.
> >> Auxiliary status is: file server running.
> >> Process last started at Mon Mar 23 17:33:57 2009 (3 proc
> >> starts)
> >> Last exit at Mon Mar 23 17:33:57 2009
> >> Command 1 is '/usr/pkg/libexec/openafs/fileserver'
> >> Command 2 is '/usr/pkg/libexec/openafs/volserver'
> >> Command 3 is '/usr/pkg/libexec/openafs/salvager'
> >>
> >> sprawl# ps auxw | grep /openafs/
> >> root 376 0.0 0.0 2316 4 ? DW 5:33PM 0:00.83 /usr/pkg/=
libexec/openafs/volserver
> >> root 727 0.0 0.0 8664 2384 ? IW<a 5:33PM 0:18.29 /usr/pkg/=
libexec/openafs/fileserver
> >> root 6739 0.0 0.0 240 4 ttyp0 R+ 12:42PM 0:00.00 grep /ope=
nafs/ (ksh)
> >> sprawl#
> >>
> >> =20
> >>>> but none of the clients (running 1.4.8 and 1.4.6) are able to
> >>>> connect to the volumes on the server, despite believing that =20
> >>>> dlg@chaos:~$ fs checkservers -fast -all
> >>>> All servers are running.
> >>>> dlg@chaos:~$ vos listvol sprawl
> >>>> Could not fetch the list of partitions from the server
> >>>> Possible communication failure
> >>>> Error in vos listvol command.
> >>>> Possible communication failure
> >>>> =20
> >>> I suspect your volserver either died or went unresponsive. What vers=
ion
> >>> of OpenAFS is the fileserver? Is there anything incriminating in
> >>> VolserLog or FileLog?
> >>> =20
> >>
> >> I should have been more clear - sprawl is the fileserver, it is
> >> running 1.4.6. There doesn't seem to be anything incriminating
> >> in FileLog, but let me turn up debugging on the volserver process
> >> on sprawl.
> >>
> >> Turning on debugging (pkill -TSTP volserver) didn't do much of
> >> anything - VolserLog hasn't been updated since 17:34 yesterday.
> >>
> >> It's short:
> >> sprawl# cat VolserLog
> >> Mon Mar 23 17:33:57 2009 Unable to connect to file server; will retry =
at need
> >> Mon Mar 23 17:33:57 2009 Starting AFS Volserver 2.0 (/usr/pkg/libexec/=
openafs/volserver)
> >> sprawl#
> >> =20
> > Did you run kill -TSTP volserver and fileserver 5 times each? That turn=
s =20
> > on the maximum amount of debugging.
>=20
> I think four. i'll go do a fifth after I send this.
>=20
> The server has spontaneously recovered (seriously. there's
> nothing in the logs) and /vicepa is now accessible locally.
>=20
> I'm suspecting some weird hardware glitch combined with a bug
> Derrick mentioned in 1.4.6 is the cause of this, but I am going
> to leave debugging turned on and see what happens overnight.
>=20
> Yes, I will post to the list with details.
So nothing untoward has happened since I tightened the screws on
the SCSI cable connecting the drive that houses /vicepa. I
suspect it was a flaky SCSI connection combined with the bug
Derrick mentione in 1.4.6. Oh well, sorry to bother everyone.
--david
>=20
> Thanks everyone, this has been a real learning experience for me.
> --david
--sGwo475CiIwWEjLI
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAknM7UkACgkQHDmo5jqnP4TALQCcDaz8FZtASukMULe8cXX2++vs
DwYAnRwuRB6vFEMxjr2AFxzoWEZOa00j
=Isx7
-----END PGP SIGNATURE-----
--sGwo475CiIwWEjLI--