[OpenAFS-devel] fileserver / volserver no longer speak to each other - 1.4.1-rc1

Rainer Toebbicke rtb@pclella.cern.ch
Mon, 19 Dec 2005 12:17:10 +0100


We've got the following curious problem on our Solaris fileservers 
running 1.4.1-rc1 (not on Linux, there it works ok!):

After a 'bos restart xxx fs', the fileserver/volserver are sometimes 
unwilling to speak to each other - giving problem subsquently e.g. 
when creating volumes.

And an 'lsof' shows for the inter-process TCP connection:

fileserve 15565 root    7u  IPv4 0x300034d7e40      0t0      TCP 
localhost:2040 (LISTEN)

volserver 15566 root    3u  IPv4 0x300034d76c0      0t0    TCP 
localhost:33429->localhost:2040 (CLOSE_WAIT)


Killing the volserver solves the problem when it gets restarted by 
bosserver and then connects ok:

volserver 15596 root    3u  IPv4 0x30004c2cdc8      0t0    TCP 
localhost:33435->localhost:2040 (ESTABLISHED)


This never happened under 1.2.x nor under the 1.3.7x we tested 
intensively. VolserLog is silent about this, nor any hint in FileLog.
Problem exists for both tvolserver and lwp-volserver.

I did not see any suspect change in fssync.c. The code in there looks 
clean. Could this be timing-related (if desperate I would hack a sleep 
into FSYNC_clientInit()), or a Solaris (5.8) problem?

Anybody else seen this?

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155