[OpenAFS-devel] fileserver / volserver no longer speak to each other - 1.4.1-rc1
Rainer Toebbicke
rtb@pclella.cern.ch
Mon, 19 Dec 2005 12:17:10 +0100
We've got the following curious problem on our Solaris fileservers
running 1.4.1-rc1 (not on Linux, there it works ok!):
After a 'bos restart xxx fs', the fileserver/volserver are sometimes
unwilling to speak to each other - giving problem subsquently e.g.
when creating volumes.
And an 'lsof' shows for the inter-process TCP connection:
fileserve 15565 root 7u IPv4 0x300034d7e40 0t0 TCP
localhost:2040 (LISTEN)
volserver 15566 root 3u IPv4 0x300034d76c0 0t0 TCP
localhost:33429->localhost:2040 (CLOSE_WAIT)
Killing the volserver solves the problem when it gets restarted by
bosserver and then connects ok:
volserver 15596 root 3u IPv4 0x30004c2cdc8 0t0 TCP
localhost:33435->localhost:2040 (ESTABLISHED)
This never happened under 1.2.x nor under the 1.3.7x we tested
intensively. VolserLog is silent about this, nor any hint in FileLog.
Problem exists for both tvolserver and lwp-volserver.
I did not see any suspect change in fssync.c. The code in there looks
clean. Could this be timing-related (if desperate I would hack a sleep
into FSYNC_clientInit()), or a Solaris (5.8) problem?
Anybody else seen this?
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985 Fax: +41 22 767 7155