[OpenAFS] fileserver crashes

Derrick J Brashear shadow@dementia.org
Wed, 13 Oct 2004 11:12:58 -0400 (EDT)


On Wed, 13 Oct 2004, Jeffrey Hutzelman wrote:

>> We have at various times gotten problems with read-only replicas that
>> are oddly truncated.  This might or might not be the consequence of the
>> previous problem.
>
> Hm.  That sounds familiar, but I thought that bug was fixed some time ago.
> In fact, Derrick confirms that the fix is in 1.2.11

The fix was for the top inode. It's conceivable some bug affects other 
inodes.

>> Another probably completely different problem we have concerns volumes
>> with really small volume IDs.  Modern AFS software creates large 10
>> digit volume IDs.  But we have volumes that were created long before
>> AFS 3.1, with small 3 digit volume IDs.  Those volumes are rapidly
>> disappearing as one by one, during various restarts, the fileserver and
>> salvager proceed to discard all the data, then the volume header.
>
> That's... bizarre.  I've never heard of such a thing, but then, we don't
> have any Linux fileservers in our cell.  I understand the Andrew cell was
> seeing this for a while, but it went away without anyone successfully
> debugging it.

It may have recurred once recently, but we can't cause it to happen on 
demand, so debugging ity has proven almost impossible.

> The last problem you describe sounds suspiciously like something Derrick
> has been trying to track down for the last 2 or 3 weeks.  I'll leave that
> to him, since he has a better idea than I of the current status of that.

We're still seeing a problem, but ours involves callback rxcon peers being 
garbage collected while there are still references to those peers in 
conns. It looks like you have a problem with a connection being 
garbage collected while something has references to it. As it happens, in 
the process of trying to fix my problem, we found and fixed several of 
those. If you want to try a patch (which may move your crashes elsewhere, 
but should not increase the frequency of crashing, and may decrease it) 
let me know.