[OpenAFS] ptserver processes hanging

Dr A V Le Blanc Dr A V Le Blanc <LeBlanc@mcc.ac.uk>
Mon, 28 Oct 2002 16:12:15 +0000


I've now got the problem that the ptserver process is hanging at times
on two of my DB servers; the process won't die even after kill -9,
but remains in the process table for several hours, and the other
machines are unable to form a quorum, so no changes can be made
to the pt database.

The two problem servers are SGI Origens with 180 MHZ IP27 processors,
running IRIX 6.5 and openafs 1.2.7.  The third server, which is not
showing this problem, is an i386 Linux box with a 1.8 GHZ processor,
running Debian woody and with openafs 1.2.7 as well.  The only way
I know to solve the problem is to reboot the server with the
hanging ptserver, which I'm usually reluctant to do, since
the salvaging at boot time usually takes about 40 minutes on
these machines, even after a clean shutdowm.  (After a power failure,
salvaging sometimes takes about 4 hours.)

The SGI machines are rock and ice, and the Linux one is snow.
Currently ice's ptserver is hung.  Below wre the results of udebug
to the two runing ptservers.  Note that each server is voting for
itself as lowest host, which is why no quorum results.  The funny
times for last vote and last beacon also seem to be parts of the
problem.

     -- Owen
     LeBlanc@mcc.ac.uk

'Udebug rock 7002 -long' returns:

Host's addresses are: 130.88.203.11 
Host's 130.88.203.11 time is Mon Oct 28 16:08:31 2002
Local time is Mon Oct 28 16:08:31 2002 (time differential 0 secs)
Last yes vote for 130.88.203.11 was 9 secs ago (not sync site); 
Last vote started 9 secs ago (at Mon Oct 28 16:08:22 2002)
Local db version is 1035451571.4
I am not sync site
Lowest host 130.88.203.11 was set 5 secs ago
Sync host 0.0.0.0 was set 32323 secs ago
Sync site's db version is 1035451571.4
0 locked pages, 0 of them for write
Last time a new db version was labelled was:
	 369740 secs ago (at Thu Oct 24 10:26:11 2002)

Server (130.88.203.12): (db 1035451571.4)
    last vote rcvd 32369 secs ago (at Mon Oct 28 07:09:02 2002),
    last beacon sent 32338 secs ago (at Mon Oct 28 07:09:33 2002), last vote was yes
    dbcurrent=1, up=0 beaconSince=0

Server (130.88.203.13): (db 1035451571.4)
    last vote rcvd 32353 secs ago (at Mon Oct 28 07:09:18 2002),
    last beacon sent 10 secs ago (at Mon Oct 28 16:08:21 2002), last vote was yes
    dbcurrent=1, up=0 beaconSince=0

and 'udebug scree 7002 -long' returns:

Host's addresses are: 130.88.203.13 
Host's 130.88.203.13 time is Mon Oct 28 16:08:42 2002
Local time is Mon Oct 28 16:08:42 2002 (time differential 0 secs)
Last yes vote for 130.88.203.13 was 1 secs ago (not sync site); 
Last vote started 1 secs ago (at Mon Oct 28 16:08:41 2002)
Local db version is 1035451571.4
I am not sync site
Lowest host 130.88.203.13 was set 1 secs ago
Sync host 0.0.0.0 was set 32364 secs ago
Sync site's db version is 1035451571.4
0 locked pages, 0 of them for write

Server (130.88.203.12): (db 0.0)
    last vote rcvd 441604 secs ago (at Wed Oct 23 14:28:38 2002),
    last beacon sent 32268 secs ago (at Mon Oct 28 07:10:54 2002), last vote was no
    dbcurrent=0, up=0 beaconSince=0

Server (130.88.203.11): (db 0.0)
    last vote rcvd 1 secs ago (at Mon Oct 28 16:08:41 2002),
    last beacon sent 1 secs ago (at Mon Oct 28 16:08:41 2002), last vote was no
    dbcurrent=0, up=1 beaconSince=1