[OpenAFS] Best way to debug "Lost contact with file server"

Ken Elkabany Ken@Elkabany.com
Fri, 24 Feb 2012 02:51:48 -0800


--f46d04428b8e53200104b9b38d60
Content-Type: text/plain; charset=ISO-8859-1

Hello,

We're running a cluster of OpenAFS machines: 2 servers (more coming soon),
and often up to 500 read-heavy clients. Occasionally (around once every
50,000+ access attempts) a client will temporarily receive the following
error:

(from client syslog)
Feb 24 08:37:17 ip-10-90-189-162 kernel: [6181788.182444] afs: Lost contact
with file server IPADDR in cell CELL (all multi-homed ip addresses down for
the server)
Feb 24 08:37:33 ip-10-90-189-162 kernel: [6181805.056860] afs: file server
IPADDR in cell CELL is back up (multi-homed address; other same-host
interfaces may still be down)

During that 16 second span of time, that client alone cannot access AFS.

I don't see any message in the openafs server logs with matching timestamps.

Currently, the servers are running 1.4.14 (will be upgraded to 1.6 soon) on
Ubuntu 10.04. The clients are running 1.6.0 on Ubuntu 11.10. The clients
are not human users, but processes that are constantly pulling data from
AFS.

What tools do I have at my disposal to debug this issue? What is the
recommended approach to take?

Off-email question: If a volume has N read replicas, how do clients choose
which one to use?

Best,
Ken

--f46d04428b8e53200104b9b38d60
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello,<div><br></div><div>We&#39;re running a cluster of OpenAFS machines: =
2 servers (more coming soon), and often up to 500 read-heavy clients. Occas=
ionally (around once every 50,000+ access attempts) a client will temporari=
ly receive the following error:</div>

<div><br></div><div>(from client syslog)</div><div><div style>Feb 24 08:37:=
17 ip-10-90-189-162 kernel: [6181788.182444] afs: Lost contact with file se=
rver IPADDR in cell=A0CELL=A0(all multi-homed ip addresses down for the ser=
ver)</div>

<div style>Feb 24 08:37:33 ip-10-90-189-162 kernel: [6181805.056860] afs: f=
ile server IPADDR in cell=A0CELL=A0is back up (multi-homed address; other s=
ame-host interfaces may still be down)</div></div><div style><br></div><div=
 style>

During that 16 second span of time, that client alone cannot access AFS.</d=
iv><div style><br></div><div style>I don&#39;t see any message in the opena=
fs server logs with matching timestamps.</div><div style><br></div><div sty=
le>

Currently, the servers are running 1.4.14 (will be upgraded to 1.6 soon) on=
 Ubuntu 10.04. The clients are running 1.6.0 on Ubuntu 11.10. The clients a=
re not human users, but processes that are constantly pulling data from AFS=
.</div>

<div style><br></div><div style>What tools do I have at my disposal to debu=
g this issue? What is the recommended approach to take?</div><div style><br=
></div><div style>Off-email question: If a volume has N read replicas, how =
do clients choose which one to use?</div>

<div style><br></div><div style>Best,</div><div style>Ken</div>

--f46d04428b8e53200104b9b38d60--