[OpenAFS] fileserver goes down overnight

david l goodrich dlg@dsrw.org
Tue, 24 Mar 2009 12:47:19 -0500


--n2Pv11Ogg/Ox8ay5
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Mar 24, 2009 at 10:39:24AM -0700, Russ Allbery wrote:
> david l goodrich <dlg@dsrw.org> writes:
>=20
> > The past two nights, I've had one of my AFS fileserver go "down"
> >
> > I say "down" and not down because it's not totally nonfunctional.
> >
> > It thinks it's running fine:
> >
> > sprawl# bos status localhost -localauth
> > Instance fs, currently running normally.
> >     Auxiliary status is: file server running.
>=20
> bos status -long is generally more useful.  However:
Can do:
sprawl# bos status localhost -localauth -long
Instance fs, (type is fs) currently running normally.
    Auxiliary status is: file server running.
    Process last started at Mon Mar 23 17:33:57 2009 (3 proc
starts)
    Last exit at Mon Mar 23 17:33:57 2009
    Command 1 is '/usr/pkg/libexec/openafs/fileserver'
    Command 2 is '/usr/pkg/libexec/openafs/volserver'
    Command 3 is '/usr/pkg/libexec/openafs/salvager'

sprawl# ps auxw | grep /openafs/
root   376  0.0  0.0 2316     4 ?       DW    5:33PM 0:00.83 /usr/pkg/libex=
ec/openafs/volserver
root   727  0.0  0.0 8664  2384 ?       IW<a  5:33PM 0:18.29 /usr/pkg/libex=
ec/openafs/fileserver
root  6739  0.0  0.0  240     4 ttyp0   R+   12:42PM 0:00.00 grep /openafs/=
 (ksh)
sprawl#

>=20
> > but none of the clients (running 1.4.8 and 1.4.6) are able to
> > connect to the volumes on the server, despite believing that=20
> > dlg@chaos:~$ fs checkservers -fast -all
> > All servers are running.
> > dlg@chaos:~$ vos listvol sprawl
> > Could not fetch the list of partitions from the server
> > Possible communication failure
> > Error in vos listvol command.
> > Possible communication failure
>=20
> I suspect your volserver either died or went unresponsive.  What version
> of OpenAFS is the fileserver?  Is there anything incriminating in
> VolserLog or FileLog?

I should have been more clear - sprawl is the fileserver, it is
running 1.4.6.  There doesn't seem to be anything incriminating
in FileLog, but let me turn up debugging on the volserver process
on sprawl.

Turning on debugging (pkill -TSTP volserver) didn't do much of
anything - VolserLog hasn't been updated since 17:34 yesterday.

It's short:
sprawl# cat VolserLog
Mon Mar 23 17:33:57 2009 Unable to connect to file server; will retry at ne=
ed
Mon Mar 23 17:33:57 2009 Starting AFS Volserver 2.0 (/usr/pkg/libexec/opena=
fs/volserver)
sprawl#

Thanks!
  --david
>=20
> --=20
> Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info

--n2Pv11Ogg/Ox8ay5
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknJHKcACgkQHDmo5jqnP4TO2QCdGPYuhBzw17zSYGFTngbdM1UZ
f6gAni+El7pwKtyl+icUdfBktJkeCEEP
=uwS9
-----END PGP SIGNATURE-----

--n2Pv11Ogg/Ox8ay5--