[OpenAFS] fileserver goes down overnight

Jason Edgecombe jason@rampaginggeek.com
Tue, 24 Mar 2009 19:15:46 -0400


david l goodrich wrote:
> On Tue, Mar 24, 2009 at 10:39:24AM -0700, Russ Allbery wrote:
>   
>> david l goodrich <dlg@dsrw.org> writes:
>>
>>     
>>> The past two nights, I've had one of my AFS fileserver go "down"
>>>
>>> I say "down" and not down because it's not totally nonfunctional.
>>>
>>> It thinks it's running fine:
>>>
>>> sprawl# bos status localhost -localauth
>>> Instance fs, currently running normally.
>>>     Auxiliary status is: file server running.
>>>       
>> bos status -long is generally more useful.  However:
>>     
> Can do:
> sprawl# bos status localhost -localauth -long
> Instance fs, (type is fs) currently running normally.
>     Auxiliary status is: file server running.
>     Process last started at Mon Mar 23 17:33:57 2009 (3 proc
> starts)
>     Last exit at Mon Mar 23 17:33:57 2009
>     Command 1 is '/usr/pkg/libexec/openafs/fileserver'
>     Command 2 is '/usr/pkg/libexec/openafs/volserver'
>     Command 3 is '/usr/pkg/libexec/openafs/salvager'
>
> sprawl# ps auxw | grep /openafs/
> root   376  0.0  0.0 2316     4 ?       DW    5:33PM 0:00.83 /usr/pkg/libexec/openafs/volserver
> root   727  0.0  0.0 8664  2384 ?       IW<a  5:33PM 0:18.29 /usr/pkg/libexec/openafs/fileserver
> root  6739  0.0  0.0  240     4 ttyp0   R+   12:42PM 0:00.00 grep /openafs/ (ksh)
> sprawl#
>
>   
>>> but none of the clients (running 1.4.8 and 1.4.6) are able to
>>> connect to the volumes on the server, despite believing that 
>>> dlg@chaos:~$ fs checkservers -fast -all
>>> All servers are running.
>>> dlg@chaos:~$ vos listvol sprawl
>>> Could not fetch the list of partitions from the server
>>> Possible communication failure
>>> Error in vos listvol command.
>>> Possible communication failure
>>>       
>> I suspect your volserver either died or went unresponsive.  What version
>> of OpenAFS is the fileserver?  Is there anything incriminating in
>> VolserLog or FileLog?
>>     
>
> I should have been more clear - sprawl is the fileserver, it is
> running 1.4.6.  There doesn't seem to be anything incriminating
> in FileLog, but let me turn up debugging on the volserver process
> on sprawl.
>
> Turning on debugging (pkill -TSTP volserver) didn't do much of
> anything - VolserLog hasn't been updated since 17:34 yesterday.
>
> It's short:
> sprawl# cat VolserLog
> Mon Mar 23 17:33:57 2009 Unable to connect to file server; will retry at need
> Mon Mar 23 17:33:57 2009 Starting AFS Volserver 2.0 (/usr/pkg/libexec/openafs/volserver)
> sprawl#
>   
Did you run kill -TSTP volserver and fileserver 5 times each? That turns 
on the maximum amount of debugging.

Thanks,
Jason