[OpenAFS] fileserver goes down overnight
david l goodrich
dlg@dsrw.org
Tue, 24 Mar 2009 18:32:27 -0500
--E0h0CbphJD8hN+Gf
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Tue, Mar 24, 2009 at 07:15:46PM -0400, Jason Edgecombe wrote:
> david l goodrich wrote:
>> On Tue, Mar 24, 2009 at 10:39:24AM -0700, Russ Allbery wrote:
>> =20
>>> david l goodrich <dlg@dsrw.org> writes:
>>>
>>> =20
>>>> The past two nights, I've had one of my AFS fileserver go "down"
>>>>
>>>> I say "down" and not down because it's not totally nonfunctional.
>>>>
>>>> It thinks it's running fine:
>>>>
>>>> sprawl# bos status localhost -localauth
>>>> Instance fs, currently running normally.
>>>> Auxiliary status is: file server running.
>>>> =20
>>> bos status -long is generally more useful. However:
>>> =20
>> Can do:
>> sprawl# bos status localhost -localauth -long
>> Instance fs, (type is fs) currently running normally.
>> Auxiliary status is: file server running.
>> Process last started at Mon Mar 23 17:33:57 2009 (3 proc
>> starts)
>> Last exit at Mon Mar 23 17:33:57 2009
>> Command 1 is '/usr/pkg/libexec/openafs/fileserver'
>> Command 2 is '/usr/pkg/libexec/openafs/volserver'
>> Command 3 is '/usr/pkg/libexec/openafs/salvager'
>>
>> sprawl# ps auxw | grep /openafs/
>> root 376 0.0 0.0 2316 4 ? DW 5:33PM 0:00.83 /usr/pkg/li=
bexec/openafs/volserver
>> root 727 0.0 0.0 8664 2384 ? IW<a 5:33PM 0:18.29 /usr/pkg/li=
bexec/openafs/fileserver
>> root 6739 0.0 0.0 240 4 ttyp0 R+ 12:42PM 0:00.00 grep /opena=
fs/ (ksh)
>> sprawl#
>>
>> =20
>>>> but none of the clients (running 1.4.8 and 1.4.6) are able to
>>>> connect to the volumes on the server, despite believing that =20
>>>> dlg@chaos:~$ fs checkservers -fast -all
>>>> All servers are running.
>>>> dlg@chaos:~$ vos listvol sprawl
>>>> Could not fetch the list of partitions from the server
>>>> Possible communication failure
>>>> Error in vos listvol command.
>>>> Possible communication failure
>>>> =20
>>> I suspect your volserver either died or went unresponsive. What version
>>> of OpenAFS is the fileserver? Is there anything incriminating in
>>> VolserLog or FileLog?
>>> =20
>>
>> I should have been more clear - sprawl is the fileserver, it is
>> running 1.4.6. There doesn't seem to be anything incriminating
>> in FileLog, but let me turn up debugging on the volserver process
>> on sprawl.
>>
>> Turning on debugging (pkill -TSTP volserver) didn't do much of
>> anything - VolserLog hasn't been updated since 17:34 yesterday.
>>
>> It's short:
>> sprawl# cat VolserLog
>> Mon Mar 23 17:33:57 2009 Unable to connect to file server; will retry at=
need
>> Mon Mar 23 17:33:57 2009 Starting AFS Volserver 2.0 (/usr/pkg/libexec/op=
enafs/volserver)
>> sprawl#
>> =20
> Did you run kill -TSTP volserver and fileserver 5 times each? That turns =
=20
> on the maximum amount of debugging.
I think four. i'll go do a fifth after I send this.
The server has spontaneously recovered (seriously. there's
nothing in the logs) and /vicepa is now accessible locally.
I'm suspecting some weird hardware glitch combined with a bug
Derrick mentioned in 1.4.6 is the cause of this, but I am going
to leave debugging turned on and see what happens overnight.
Yes, I will post to the list with details.
Thanks everyone, this has been a real learning experience for me.
--david
--E0h0CbphJD8hN+Gf
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAknJbYsACgkQHDmo5jqnP4Q3iACdEx98Ri1PGag1jYRCih1CVgem
mLsAn1ULkPiKYhCwI0nA1avCow5mx3zs
=6zog
-----END PGP SIGNATURE-----
--E0h0CbphJD8hN+Gf--