[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

Mon, 31 Jan 2011 12:17:04 -0500 (EST)

On Mon, 31 Jan 2011, Steve Simmons wrote:

> We have seen similar issues. It occurs when there is a given vice 
> partition where lots of clients have registered callbacks but those 
> clients are no longer accessible. Not all the clients have responded when 
> the 1800 second timer goes off, and the fileserver goes down uncleanly.
>
> We have about 235,000 volumes spread across 40 vice partitions. Our 'fix' 
> is a combination of lengthening that timeout to a 3600 seconds and 
> keeping our vice partitions no longer than 2TB. Active partitions are 
> spread roughly equally across those 40 partitions. But that's just a 
> stopgap; the longer a server stays up, the more likely it accumulates 
> dead callbacks.

Assuming this is true, isn't this a good argument to keep the weekly server 
process restarts?

Cheers,
Stephen