[OpenAFS] fileserver goes down overnight

david l goodrich dlg@dsrw.org
Tue, 24 Mar 2009 18:32:27 -0500


--E0h0CbphJD8hN+Gf
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Mar 24, 2009 at 07:15:46PM -0400, Jason Edgecombe wrote:
> david l goodrich wrote:
>> On Tue, Mar 24, 2009 at 10:39:24AM -0700, Russ Allbery wrote:
>>  =20
>>> david l goodrich <dlg@dsrw.org> writes:
>>>
>>>    =20
>>>> The past two nights, I've had one of my AFS fileserver go "down"
>>>>
>>>> I say "down" and not down because it's not totally nonfunctional.
>>>>
>>>> It thinks it's running fine:
>>>>
>>>> sprawl# bos status localhost -localauth
>>>> Instance fs, currently running normally.
>>>>     Auxiliary status is: file server running.
>>>>      =20
>>> bos status -long is generally more useful.  However:
>>>    =20
>> Can do:
>> sprawl# bos status localhost -localauth -long
>> Instance fs, (type is fs) currently running normally.
>>     Auxiliary status is: file server running.
>>     Process last started at Mon Mar 23 17:33:57 2009 (3 proc
>> starts)
>>     Last exit at Mon Mar 23 17:33:57 2009
>>     Command 1 is '/usr/pkg/libexec/openafs/fileserver'
>>     Command 2 is '/usr/pkg/libexec/openafs/volserver'
>>     Command 3 is '/usr/pkg/libexec/openafs/salvager'
>>
>> sprawl# ps auxw | grep /openafs/
>> root   376  0.0  0.0 2316     4 ?       DW    5:33PM 0:00.83 /usr/pkg/li=
bexec/openafs/volserver
>> root   727  0.0  0.0 8664  2384 ?       IW<a  5:33PM 0:18.29 /usr/pkg/li=
bexec/openafs/fileserver
>> root  6739  0.0  0.0  240     4 ttyp0   R+   12:42PM 0:00.00 grep /opena=
fs/ (ksh)
>> sprawl#
>>
>>  =20
>>>> but none of the clients (running 1.4.8 and 1.4.6) are able to
>>>> connect to the volumes on the server, despite believing that =20
>>>> dlg@chaos:~$ fs checkservers -fast -all
>>>> All servers are running.
>>>> dlg@chaos:~$ vos listvol sprawl
>>>> Could not fetch the list of partitions from the server
>>>> Possible communication failure
>>>> Error in vos listvol command.
>>>> Possible communication failure
>>>>      =20
>>> I suspect your volserver either died or went unresponsive.  What version
>>> of OpenAFS is the fileserver?  Is there anything incriminating in
>>> VolserLog or FileLog?
>>>    =20
>>
>> I should have been more clear - sprawl is the fileserver, it is
>> running 1.4.6.  There doesn't seem to be anything incriminating
>> in FileLog, but let me turn up debugging on the volserver process
>> on sprawl.
>>
>> Turning on debugging (pkill -TSTP volserver) didn't do much of
>> anything - VolserLog hasn't been updated since 17:34 yesterday.
>>
>> It's short:
>> sprawl# cat VolserLog
>> Mon Mar 23 17:33:57 2009 Unable to connect to file server; will retry at=
 need
>> Mon Mar 23 17:33:57 2009 Starting AFS Volserver 2.0 (/usr/pkg/libexec/op=
enafs/volserver)
>> sprawl#
>>  =20
> Did you run kill -TSTP volserver and fileserver 5 times each? That turns =
=20
> on the maximum amount of debugging.

I think four.  i'll go do a fifth after I send this.

The server has spontaneously recovered (seriously.  there's
nothing in the logs) and /vicepa is now accessible locally.

I'm suspecting some weird hardware glitch combined with a bug
Derrick mentioned in 1.4.6 is the cause of this, but I am going
to leave debugging turned on and see what happens overnight.

Yes, I will post to the list with details.

Thanks everyone, this has been a real learning experience for me.
  --david

--E0h0CbphJD8hN+Gf
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknJbYsACgkQHDmo5jqnP4Q3iACdEx98Ri1PGag1jYRCih1CVgem
mLsAn1ULkPiKYhCwI0nA1avCow5mx3zs
=6zog
-----END PGP SIGNATURE-----

--E0h0CbphJD8hN+Gf--