[OpenAFS] unresponsive clients after lowest ip database server went down

Jonathan Leung-Nilsson jnilsson@uci.edu
Thu, 27 Aug 2015 13:35:12 -0700


--001a11355fd89104f0051e50e8d3
Content-Type: text/plain; charset=UTF-8

Hello,

I have recovered from this situation already, but I was curious to hear if
others have tested or experienced this issue as well:

If the AFS database server with the lowest IP address goes down or is
offline, but there are 2 other database servers available, then are clients
and the remaining servers supposed to be able to handle that situation
gracefully? We had an incident where this happened (in our case, the
database server was taken offline because the switch died), and then it
appeared that AFS access (simply "ls /afs/<cellname>/") and vos commands
were unresponsive.

AFS database servers:
lowest IP address pt/vl server: CentOS 5.11 32-bit OpenAFS 1.6.11
secondary and tertiary pt/vl server: CentOS 6.6 64-bit OpenAFS 1.6.11

Clients (at least these were affected, among others I am sure):
CentOS 6.6 64-bit OpenAFS 1.6.1
CentOS 6.7 64-bit OpenAFS 1.6.9

Clients are all configured with -dynroot -fakestat-all and they have
identical CellServDB files listing our database servers in order from
lowest to highest IP.

I apologize that I do not have much in the way of debugging output... I
didn't think to run rxdebug on the client or a trace of the "ls" process.
We were in "emergency mode" trying to get the switch replaced to bring
services online, but I was still surprised that AFS exhibited this trouble.
I will try to replicate this issue in a test cell in the near future...

So I am mainly wondering if this is expected - if OpenAFS depends on having
its lowest IP address server online all the time - or if it's likely that
we have a configuration issue in our cell. I setup our cell about 5 years
ago as a complete newbie to OpenAFS, and while I've gained a lot of
insights and experience since, I still don't understand all the nuances.

Thank you!
-- 
Jonathan Leung-Nilsson
Social Sciences Computing Services
University of California, Irvine

--001a11355fd89104f0051e50e8d3
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello,<div><br></div><div>I have recovered from this situa=
tion already, but I was curious to hear if others have tested or experience=
d this issue as well:</div><div><br></div><div>If the AFS database server w=
ith the lowest IP address goes down or is offline, but there are 2 other da=
tabase servers available, then are clients and the remaining servers suppos=
ed to be able to handle that situation gracefully? We had an incident where=
 this happened (in our case, the database server was taken offline because =
the switch died), and then it appeared that AFS access (simply &quot;ls /af=
s/&lt;cellname&gt;/&quot;) and vos commands were unresponsive.<br clear=3D"=
all"><div><br></div><div><div>AFS database servers:</div><div>lowest IP add=
ress pt/vl server: CentOS 5.11 32-bit OpenAFS 1.6.11</div><div>secondary an=
d tertiary pt/vl server: CentOS 6.6 64-bit OpenAFS 1.6.11</div><div><br></d=
iv><div>Clients (at least these were affected, among others I am sure):</di=
v><div>CentOS 6.6 64-bit OpenAFS 1.6.1</div><div>CentOS 6.7 64-bit OpenAFS =
1.6.9</div></div><div><br></div><div>Clients are all configured with -dynro=
ot -fakestat-all and they have identical CellServDB files listing our datab=
ase servers in order from lowest to highest IP.</div><div><br></div><div>I =
apologize that I do not have much in the way of debugging output... I didn&=
#39;t think to run rxdebug on the client or a trace of the &quot;ls&quot; p=
rocess. We were in &quot;emergency mode&quot; trying to get the switch repl=
aced to bring services online, but I was still surprised that AFS exhibited=
 this trouble. I will try to replicate this issue in a test cell in the nea=
r future...</div><div><br></div><div>So I am mainly wondering if this is ex=
pected - if OpenAFS depends on having its lowest IP address server online a=
ll the time - or if it&#39;s likely that we have a configuration issue in o=
ur cell. I setup our cell about 5 years ago as a complete newbie to OpenAFS=
, and while I&#39;ve gained a lot of insights and experience since, I still=
 don&#39;t understand all the nuances.</div><div><br></div><div>Thank you!<=
/div>-- <br><div><div dir=3D"ltr">Jonathan Leung-Nilsson<br>Social Sciences=
 Computing Services<br>University of California, Irvine<br></div></div>
</div></div>

--001a11355fd89104f0051e50e8d3--