[OpenAFS] Re: proper way to bring down a file server?

Derrick Brashear shadow@dementia.org
Thu, 24 Feb 2011 22:11:12 +0000



On Feb 24, 2011, at 4:44 PM, Andrew Deason <adeason@sinenomine.net> wrote:

> On Thu, 24 Feb 2011 08:40:13 +0100
> Derrick Brashear <shadow@dementia.org> wrote:
>=20
>>> Regardless of whatever bugs on the fileserver may be in play,
>>> clients should indeed issue a new query on VOFFLINE.
>>=20
>> bugs on the fileserver? how about 'none'?
>=20
> In this case, yeah, I assume so; but Jeff is correct that there have
> been bugs where VOFFLINE was reported when it should not have been. I'm
> just saying "even if that were not the case..."
>=20
>>> A VOFFLINE error can be the result of incorrect/stale volume
>>> location information (if a volume is offline on one server but
>>> online another), and so the current information should be looked up
>>> when it occurs.
>>=20
>> huge waste of RPCs for a legitimate operating condition albeit an
>> undesirable one. you'll create a vldb storm if a 'popular' volume goes
>> offline.
>=20
> "Huge". Unix clients have been doing it ~forever, and the number of
> places I have ever heard of even noticing a vlserver load I can probably
> count on one hand.
>=20

It came up again in the historical research I just did, so it's fresh on my m=
ind how poor RPC semantics can exacerbate it. Huge? As much as N clients ver=
sus none, within the smallest (timeout-wise) callback bucket, is huge.

>>>> In a similar vein, if the file server is inaccessible, the client
>>>> does not issue a new VLDB query.
>>>=20
>>> ...this is intentional? Why doesn't it? We could be contacting the
>>> wrong server because we have stale location information.
>>=20
>> We could. But that's basically true of any error, and if we run to
>> mommy on every error, eventually mommy can't handle us being so pesky
>> and melts down
>=20
> Some are still a lot more likely to have "stale location" to be the
> cause than others. The probability of RX_CALL_DEAD being so I suppose is
> rather small, as it only happens in this "move and shutdown" scenario,
> and leaving the server on isn't too hard. Of course, such is not always
> under the control of the administrator, but eh.
>=20

We advertise 'leave it up 2 hours or suffer', or have previously and should a=
gain. The horse is pointed at water. Drink already.