[OpenAFS] Fileserver process hung on startup

John Morris openafs@butchwax.com
29 Mar 2004 20:41:07 -0600


# ping -c 1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 : 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.099 ms

--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% loss, time 0ms
rtt min/avg/max/mdev = 0.099/0.099/0.099/0.000 ms
# ifconfig lo
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:109901 errors:0 dropped:0 overruns:0 frame:0
          TX packets:109901 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:10104040 (9.6 Mb)  TX bytes:10104040 (9.6 Mb)

#

Sure is.

What I'm guessing is that, judging by the fileserver's complete
unresponsiveness and that it's not even making system calls, that
something is hanging up, maybe a system call?  Is there any
documentation on the '-d #' argument to the fileserver executable's
commandline?  What debug level could I give it so it would give me a
clue what it's doing?  Where else can I look for debugging information
besides the sources I listed in my first email (included below for a
reminder)?

Thanks again!

	John



On Mon, 2004-03-29 at 17:08, Derrick J Brashear wrote:
> On Mon, 29 Mar 2004, John Morris wrote:
> 
> > Oops, 1.2.11.  :)
> >
> > I tried this, but no effect, the fileserver is still not listening on
> > 2040.  What should this have changed?
> 
> is the loopback interface up?
> 
> ifconfig lo
> 
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info


Hi!  See what y'all can do with this.

Openafs 2.9.11, custom smp kernel 2.4.23.

Three fileserver cell, one fileserver, kug, suddenly stops serving
files; clients see 'connection timed out'.

AFS server processes seem to be running normally as reported by bos
status.

  # bos status kug -long -local
  Instance ptserver, (type is simple) currently running normally.
      Process last started at Sun Mar 28 16:51:58 2004 (2 proc starts)
      Last exit at Sun Mar 28 16:51:55 2004
      Command 1 is '/usr/afs/bin/ptserver'

  Instance vlserver, (type is simple) currently running normally.
      Process last started at Sun Mar 28 16:51:58 2004 (2 proc starts)
      Last exit at Sun Mar 28 16:51:55 2004
      Command 1 is '/usr/afs/bin/vlserver'

  Instance fs, (type is fs) currently running normally.
      Auxiliary status is: file server running.
      Process last started at Mon Mar 29 01:29:36 2004 (11 proc starts)
      Last exit at Mon Mar 29 01:29:36 2004
      Last error exit at Mon Mar 29 01:29:36 2004, by vol, by exiting
with code 1
      Command 1 is '/usr/afs/bin/fileserver'
      Command 2 is '/usr/afs/bin/volserver'
      Command 3 is '/usr/afs/bin/salvager'
  #

Port 2040 not being listened on:

  # netstat -tl | grep 2040
  # 

Get these errors from 2040 not being open:

   FSYNC_clientInit temporary failure (will retry): Connection refused

Any fs commands on kug's filesystems hang for a long time before timing
out.

strace on fileserver process finds process in seemingly hung state, ie.
no system calls until process is killed.

Haven't noticed anything else funny about /vicepa; salvages complete
with no errors.

Volume DB is frozen as long as fileserver process is running; once
fileserver is killed, voldb comes back online.

Lsof shows kug's fileserver process compared with another normally
running fileserver's process has similar files open, except
localhost:2040, and of course /vicepa files.

restarts and reboots don't help.

That's all I can think of.  Any ideas?  Thanks for any suggestions!  My
home directory is on this fileserver, so help will be appreciated extra!

        John