[OpenAFS] fileserver goes down overnight

david l goodrich dlg@dsrw.org
Fri, 27 Mar 2009 10:14:18 -0500


--sGwo475CiIwWEjLI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Mar 24, 2009 at 06:32:27PM -0500, david l goodrich wrote:
> On Tue, Mar 24, 2009 at 07:15:46PM -0400, Jason Edgecombe wrote:
> > david l goodrich wrote:
> >> On Tue, Mar 24, 2009 at 10:39:24AM -0700, Russ Allbery wrote:
> >>  =20
> >>> david l goodrich <dlg@dsrw.org> writes:
> >>>
> >>>    =20
> >>>> The past two nights, I've had one of my AFS fileserver go "down"
> >>>>
> >>>> I say "down" and not down because it's not totally nonfunctional.
> >>>>
> >>>> It thinks it's running fine:
> >>>>
> >>>> sprawl# bos status localhost -localauth
> >>>> Instance fs, currently running normally.
> >>>>     Auxiliary status is: file server running.
> >>>>      =20
> >>> bos status -long is generally more useful.  However:
> >>>    =20
> >> Can do:
> >> sprawl# bos status localhost -localauth -long
> >> Instance fs, (type is fs) currently running normally.
> >>     Auxiliary status is: file server running.
> >>     Process last started at Mon Mar 23 17:33:57 2009 (3 proc
> >> starts)
> >>     Last exit at Mon Mar 23 17:33:57 2009
> >>     Command 1 is '/usr/pkg/libexec/openafs/fileserver'
> >>     Command 2 is '/usr/pkg/libexec/openafs/volserver'
> >>     Command 3 is '/usr/pkg/libexec/openafs/salvager'
> >>
> >> sprawl# ps auxw | grep /openafs/
> >> root   376  0.0  0.0 2316     4 ?       DW    5:33PM 0:00.83 /usr/pkg/=
libexec/openafs/volserver
> >> root   727  0.0  0.0 8664  2384 ?       IW<a  5:33PM 0:18.29 /usr/pkg/=
libexec/openafs/fileserver
> >> root  6739  0.0  0.0  240     4 ttyp0   R+   12:42PM 0:00.00 grep /ope=
nafs/ (ksh)
> >> sprawl#
> >>
> >>  =20
> >>>> but none of the clients (running 1.4.8 and 1.4.6) are able to
> >>>> connect to the volumes on the server, despite believing that =20
> >>>> dlg@chaos:~$ fs checkservers -fast -all
> >>>> All servers are running.
> >>>> dlg@chaos:~$ vos listvol sprawl
> >>>> Could not fetch the list of partitions from the server
> >>>> Possible communication failure
> >>>> Error in vos listvol command.
> >>>> Possible communication failure
> >>>>      =20
> >>> I suspect your volserver either died or went unresponsive.  What vers=
ion
> >>> of OpenAFS is the fileserver?  Is there anything incriminating in
> >>> VolserLog or FileLog?
> >>>    =20
> >>
> >> I should have been more clear - sprawl is the fileserver, it is
> >> running 1.4.6.  There doesn't seem to be anything incriminating
> >> in FileLog, but let me turn up debugging on the volserver process
> >> on sprawl.
> >>
> >> Turning on debugging (pkill -TSTP volserver) didn't do much of
> >> anything - VolserLog hasn't been updated since 17:34 yesterday.
> >>
> >> It's short:
> >> sprawl# cat VolserLog
> >> Mon Mar 23 17:33:57 2009 Unable to connect to file server; will retry =
at need
> >> Mon Mar 23 17:33:57 2009 Starting AFS Volserver 2.0 (/usr/pkg/libexec/=
openafs/volserver)
> >> sprawl#
> >>  =20
> > Did you run kill -TSTP volserver and fileserver 5 times each? That turn=
s =20
> > on the maximum amount of debugging.
>=20
> I think four.  i'll go do a fifth after I send this.
>=20
> The server has spontaneously recovered (seriously.  there's
> nothing in the logs) and /vicepa is now accessible locally.
>=20
> I'm suspecting some weird hardware glitch combined with a bug
> Derrick mentioned in 1.4.6 is the cause of this, but I am going
> to leave debugging turned on and see what happens overnight.
>=20
> Yes, I will post to the list with details.

So nothing untoward has happened since I tightened the screws on
the SCSI cable connecting the drive that houses /vicepa.  I
suspect it was a flaky SCSI connection combined with the bug
Derrick mentione in 1.4.6.  Oh well, sorry to bother everyone.
  --david

>=20
> Thanks everyone, this has been a real learning experience for me.
>   --david



--sGwo475CiIwWEjLI
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknM7UkACgkQHDmo5jqnP4TALQCcDaz8FZtASukMULe8cXX2++vs
DwYAnRwuRB6vFEMxjr2AFxzoWEZOa00j
=Isx7
-----END PGP SIGNATURE-----

--sGwo475CiIwWEjLI--