[OpenAFS-devel] 1.3.X: Problem with many connections exhausting resources?

Harald Barth haba@pdc.kth.se
Mon, 30 May 2005 12:59:22 +0200 (MEST)


> > 1.3.77 has rx.c 1.58.2.4 and rx-makecall-race-fix-20050518 was introduced
> > with 1.58.2.18.
> 
> DELTA STABLE12-rx-makecall-race-fix-20050518 fixes a bug that has
> existed since the beginning of time.

I believe that. But rx-makecall-race-fix-20050518 can not be the _cause_
for the "many connections" behaviour because it was introduced after
1.3.77. Right?

> Last October/November we fixed a large number of bugs related to
> rx connection objects being mismanaged due to an inability of
> applications which use rx to reference count the rx_connection
> objects.   This meant that although internally the rx library was
> thread safe, a single rx_connection object could not safely be used
> by multiple threads in the application.   In 1.3.72 we added the
> ability to reference count the rx_connection objects and in turn
> removed the premature destructon of the objects while in use that
> we had been seeing.

> We have also seen huge number of connections being produced by the
> Windows clients.  The Windows clients were creating connections,
> using them once, and then destroying them.   Most of these Windows
> client bugs were fixed in 1.3.72 and one more was fixed in 1.3.80.

Essentialy I need some tracking of connections being created and
connections being moved from one state to another (which states
are there and what do they mean). On our calculation boxes
the usage pattern is quite deterministic. Node empty -> single
user logs in with uid-token -> user runs script -> user logs 
out -> cleaner runs and does some rm -> start over. Cycle time
is between minutes and days. Never two users at the same time.
As it is natural that a new connection is created for a new
pag, my guess is that the problem is in the CG of the hash.

> Having multiple connections from a client is not necessarily a problem.
> If the connection properties are different, there will need to be
> separate connections.   It is also possible that you will have
> connections in the hash table that are marked as DESTROYED that have
> not been cleaned up yet. 

"Not cleaned up yet" as in minutes, hours or days?

>  I'm not sure that your script produces the
> results you are expecting.

Me neither, but 6073 connections on a box that is used by one user at
a time seems a bit much.

> Do you have any other information you can provide on your server down
> problem?

Damn little. And the 'server down' goes away if you wait long enough. I
don't know how long "long enough" is, but it is not hours, rather days.

Harald.