[OpenAFS] ptserver processes hanging

Hartmut Reuter reuter@rzg.mpg.de
Mon, 28 Oct 2002 18:00:09 +0100


Dr A V Le Blanc wrote:
> I've now got the problem that the ptserver process is hanging at times
> on two of my DB servers; the process won't die even after kill -9,
> but remains in the process table for several hours, and the other
> machines are unable to form a quorum, so no changes can be made
> to the pt database.

For me this looks like a problem with the file-system where /usr/afs/db 
is in. The ptserver is pure userland code.

I would separate the database servers from fileservers, anyway (keeping 
the ip-addresses for the database servervs, of course).

> 
> The two problem servers are SGI Origens with 180 MHZ IP27 processors,
> running IRIX 6.5 and openafs 1.2.7.  The third server, which is not
> showing this problem, is an i386 Linux box with a 1.8 GHZ processor,
> running Debian woody and with openafs 1.2.7 as well.  The only way
> I know to solve the problem is to reboot the server with the
> hanging ptserver, which I'm usually reluctant to do, since
> the salvaging at boot time usually takes about 40 minutes on
> these machines, even after a clean shutdowm.  (After a power failure,
> salvaging sometimes takes about 4 hours.)

You should configure the build of your fileservers with 
"--enable-fast-restart". This skips the salvage and lets your 
fileservers come back immediately. If you really have a damaged volume 
it will probably go off-line by itself and you can salvage it later 
without shuting down the fileserver. We do this since years without any 
bad experience.

Hartmut

> 
> The SGI machines are rock and ice, and the Linux one is snow.
> Currently ice's ptserver is hung.  Below wre the results of udebug
> to the two runing ptservers.  Note that each server is voting for
> itself as lowest host, which is why no quorum results.  The funny
> times for last vote and last beacon also seem to be parts of the
> problem.
> 
>      -- Owen
>      LeBlanc@mcc.ac.uk
> 
> 'Udebug rock 7002 -long' returns:
> 
> Host's addresses are: 130.88.203.11 
> Host's 130.88.203.11 time is Mon Oct 28 16:08:31 2002
> Local time is Mon Oct 28 16:08:31 2002 (time differential 0 secs)
> Last yes vote for 130.88.203.11 was 9 secs ago (not sync site); 
> Last vote started 9 secs ago (at Mon Oct 28 16:08:22 2002)
> Local db version is 1035451571.4
> I am not sync site
> Lowest host 130.88.203.11 was set 5 secs ago
> Sync host 0.0.0.0 was set 32323 secs ago
> Sync site's db version is 1035451571.4
> 0 locked pages, 0 of them for write
> Last time a new db version was labelled was:
> 	 369740 secs ago (at Thu Oct 24 10:26:11 2002)
> 
> Server (130.88.203.12): (db 1035451571.4)
>     last vote rcvd 32369 secs ago (at Mon Oct 28 07:09:02 2002),
>     last beacon sent 32338 secs ago (at Mon Oct 28 07:09:33 2002), last vote was yes
>     dbcurrent=1, up=0 beaconSince=0
> 
> Server (130.88.203.13): (db 1035451571.4)
>     last vote rcvd 32353 secs ago (at Mon Oct 28 07:09:18 2002),
>     last beacon sent 10 secs ago (at Mon Oct 28 16:08:21 2002), last vote was yes
>     dbcurrent=1, up=0 beaconSince=0
> 
> and 'udebug scree 7002 -long' returns:
> 
> Host's addresses are: 130.88.203.13 
> Host's 130.88.203.13 time is Mon Oct 28 16:08:42 2002
> Local time is Mon Oct 28 16:08:42 2002 (time differential 0 secs)
> Last yes vote for 130.88.203.13 was 1 secs ago (not sync site); 
> Last vote started 1 secs ago (at Mon Oct 28 16:08:41 2002)
> Local db version is 1035451571.4
> I am not sync site
> Lowest host 130.88.203.13 was set 1 secs ago
> Sync host 0.0.0.0 was set 32364 secs ago
> Sync site's db version is 1035451571.4
> 0 locked pages, 0 of them for write
> 
> Server (130.88.203.12): (db 0.0)
>     last vote rcvd 441604 secs ago (at Wed Oct 23 14:28:38 2002),
>     last beacon sent 32268 secs ago (at Mon Oct 28 07:10:54 2002), last vote was no
>     dbcurrent=0, up=0 beaconSince=0
> 
> Server (130.88.203.11): (db 0.0)
>     last vote rcvd 1 secs ago (at Mon Oct 28 16:08:41 2002),
>     last beacon sent 1 secs ago (at Mon Oct 28 16:08:41 2002), last vote was no
>     dbcurrent=0, up=1 beaconSince=1
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info


-- 
-----------------------------------------------------------------
Hartmut Reuter                           e-mail reuter@rzg.mpg.de
					   phone +49-89-3299-1328
RZG (Rechenzentrum Garching)               fax   +49-89-3299-1301
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------