[OpenAFS] unresponsive clients after lowest ip database server went down

Jonathan Leung-Nilsson jnilsson@uci.edu
Fri, 28 Aug 2015 10:25:08 -0700


--001a11355fd8af705d051e625e03
Content-Type: text/plain; charset=UTF-8

On Thu, Aug 27, 2015 at 5:06 PM, Benjamin Kaduk <kaduk@mit.edu> wrote:

>
> On Thu, 27 Aug 2015, Jonathan Leung-Nilsson wrote:
>
> > So I am mainly wondering if this is expected - if OpenAFS depends on
> having
> > its lowest IP address server online all the time - or if it's likely that
> > we have a configuration issue in our cell.
>
> The short answer is that clients are expected to continue functioning even
> if the lowest-IP db server is offline, the remaining N-1 are supposed to

elect a new coordinator and read-write access resume within a couple
> election cycles;


Thank you for confirming. This is what I thought, just wanted to check that
I wasn't crazy. Our 2 remaining DB servers did select a new coordinator
among themselves, so that part worked.

clients might experience full hangs or just inability to
> make database changes for a couple minutes as things recover.
>

"a couple minutes" would be bad enough, since we have websites using AFS as
their DocumentRoot, but in our case it took a little over an hour until the
incident was resolved (we replaced the network switch that the AFS db
server was behind) and clients appeared unresponsive the entire time.

The long answer requires more research and discussion of edge cases such
> as network partitions, timeouts, and such, which I am not prepared to
> perform right now.


Yeah... that means this issue is very specific to our setup and the failure
situation. I'll see if I have time to try to replicate it and figure out
why the clients were unresponsive. Most likely we will find alternative
ways to mitigate the impact of this kind of failure.

Best,
Jonathan

--001a11355fd8af705d051e625e03
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On T=
hu, Aug 27, 2015 at 5:06 PM, Benjamin Kaduk <span dir=3D"ltr">&lt;<a href=
=3D"mailto:kaduk@mit.edu" target=3D"_blank">kaduk@mit.edu</a>&gt;</span> wr=
ote:<br></div><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><sp=
an class=3D""><br>
On Thu, 27 Aug 2015, Jonathan Leung-Nilsson wrote:<br>
<br>
&gt; So I am mainly wondering if this is expected - if OpenAFS depends on h=
aving<br>
&gt; its lowest IP address server online all the time - or if it&#39;s like=
ly that<br>
&gt; we have a configuration issue in our cell.<br>
<br>
</span>The short answer is that clients are expected to continue functionin=
g even<br>
if the lowest-IP db server is offline, the remaining N-1 are supposed to</b=
lockquote><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">
elect a new coordinator and read-write access resume within a couple<br>
election cycles; </blockquote><div><br></div><div>Thank you for confirming.=
 This is what I thought, just wanted to check that I wasn&#39;t crazy. Our =
2 remaining DB servers did select a new coordinator among themselves, so th=
at part worked.</div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">clients =
might experience full hangs or just inability to<br>
make database changes for a couple minutes as things recover.<br></blockquo=
te><div><br></div><div>&quot;a couple minutes&quot; would be bad enough, si=
nce we have websites using AFS as their DocumentRoot, but in our case it to=
ok a little over an hour until the incident was resolved (we replaced the n=
etwork switch that the AFS db server was behind) and clients appeared unres=
ponsive the entire time.</div><div><br></div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"=
>The long answer requires more research and discussion of edge cases such<b=
r>
as network partitions, timeouts, and such, which I am not prepared to<br>
perform right now.</blockquote><div><br></div><div>Yeah... that means this =
issue is very specific to our setup and the failure situation. I&#39;ll see=
 if I have time to try to replicate it and figure out why the clients were =
unresponsive. Most likely we will find alternative ways to mitigate the imp=
act of this kind of failure.</div><div><br></div><div>Best,</div><div>Jonat=
han</div><div>=C2=A0</div></div>
</div></div>

--001a11355fd8af705d051e625e03--