[OpenAFS] OpenAFS and Xen wierdnesses: regular loss of afs server connectivity

Chris Kurtz chk@mars.asu.edu
Fri, 30 Jan 2009 10:35:02 -0700


Specs:

CentOS 5.2 dom0 and domU, both running 2.6.18-53.1.6.el5xen
OpenAFS 1.4.4 rebuilt (for our defaults) from OpenAFS's rpm
domU has a 45gb afs cache volume mounted from the dom0

The domU is a webserver running lighttpd, drupal, and a fair amount of
custom python. The dom0 does nothing but run xens. Machines are new
(dual quad Xeons 2ghz, 32gb on the dom0, 8gb in the domU). There are
multiple identical machines this happens on (cloned from a common source).

On a frequent basis (sometimes as often as every few minutes), we lose
contact with any afs server that we're hitting with any severity, for a
couple of minutes at a time:

Jan 30 10:28:48 www4 kernel: afs: Lost contact with volume location
server 149.169.146.57 in cell mars.asu.edu
Jan 30 10:30:03 www4 kernel: afs: volume location server 149.169.146.57
in cell mars.asu.edu is back up

I see corresponding errors in lighttpd's log:

2009-01-30 10:28:56: (mod_fastcgi.c.2618) FastCGI-stderr: Traceback
(most recent call last):
IOError: [Errno 110] Connection timed out:
'/afs/mars.asu.edu/themis-data/pds/browse/i267xx/I26712018.png'

It isn't isolated to a single AFS server, all the servers in the cell
can cause the behavior.

Ideas?

...Chris

--
Chris Kurtz, chk@mars.asu.edu
Systems Manager
Mars Space Flight Facility
Arizona State University