[OpenAFS] errors in afs when multiple tasks are running

Jeffrey Altman jaltman@secure-endpoints.com
Fri, 8 Jul 2011 18:56:49 -0400


I would be interested in seeing a tcpdump of traffic between the affected c=
lient and the file server covering the few minutes before the build failure=
 until the client marks the server up again.=0A=
=0A=
-----Original Message-----
From: Mark Henry
Sent: Friday, July 08, 2011 5:46 PM
To: openafs-info@openafs.org
Subject: [OpenAFS] errors in afs when multiple tasks are running


We have been getting periodic build failures when building in afs.  Here is
/var/log/messages at the time of the failure:

Jul  6 20:58:01 hostname /usr/sbin/cron[10555]:
pam_krb5_compiled(crond:account): pam_sm_acct_mgmt: exit (ignore)
Jul  6 20:58:01 hostname /usr/sbin/cron[10556]: (userid1) CMD (${K5S_USERID=
1}
-- /bin/sh -c "/a/p/cpui/build/dir/check_for_bld_request.sh >
/a/p/cpui/build/dir/hostname_check.out 2>&1" >> /home/userid1/k5start.out 2=
>&1)
Jul  6 20:58:02 hostname kernel: afs: Lost contact with file server
192.168.15.33 in cell cellname.com (all multi-homed ip addresses down for t=
he
server)
Jul  6 20:58:02 hostname kernel: afs: Lost contact with file server
192.168.15.33 in cell cellname.com (all multi-homed ip addresses down for t=
he
server)
Jul  6 20:58:04 hostname kernel: afs: Tokens for user of AFS id -1 for cell
cellname.com have expired
Jul  6 20:58:07 hostname kernel: afs: failed to store file (110)
Jul  6 20:58:08 hostname kernel: afs: failed to store file (110)
Jul  6 20:58:08 hostname kernel: afs: failed to store file (110)
Jul  6 20:58:23 hostname kernel: afs: file server 192.168.15.33 in cell
cellname.com is back up (multi-homed address; other same-host interfaces ma=
y
still be down)
Jul  6 20:58:23 hostname kernel: afs: file server 192.168.15.33 in cell
cellname.com is back up (multi-homed address; other same-host interfaces ma=
y
still be down)
Jul  6 21:00:01 hostname /usr/sbin/cron[10588]:
pam_krb5_compiled(crond:account): pam_sm_acct_mgmt: entry (0x8000)
Jul  6 21:00:01 hostname /usr/sbin/cron[10587]:
pam_krb5_compiled(crond:account): pam_sm_acct_mgmt: entry (0x8000)

I don't find any errors on the file server that it loses connection to.  Ou=
r
afs servers are on AIX and the client system running the build is opensuse =
11.1
with the afs client at 1.4.11.

Every minute a simple script (hostname_check.out) goes out to afs and looks=
 for
a file.  Most times there are no problems.  Occasionally this harmless scri=
pt
running seems to mess up the connection to the file server for the running
build (the Lost contact error always occurs a few seconds after the minute)=
.
Also, we have moved the script to run at 20 seconds after the minute and th=
e
errors follow the same pattern only 20 seconds later.  This has happened wi=
th
multiple scripts that access afs so the scripts themselves don't seem to be=
 the
problem.

The build uses k5start for creds which seems fine.  The errors are on diffe=
rent
systems (all opensuse) at random times so it is hard to trace.  Also we
increased the size of the afs cache to 5g hoping that would help and it did=
n't
seem to help.

Any ideas?

Mark Henry
Advisory Software Engineer
Ricoh Production Print Solutions, LLC


___________________________________________________________________________=
__
"This message and any attachments are solely for the intended recipient and=
 may contain confidential or privileged information. If you are not the int=
ended recipient, any disclosure, copying, use, or distribution of the infor=
mation included in this message and any attachments is prohibited. If you h=
ave received this communication in error, please notify us by reply e-mail =
and immediately and permanently delete this message and any attachments. Th=
ank you." _________________________________________________________________=
____________
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info