[OpenAFS] errors in afs when multiple tasks are running

Mark Henry mark.henry@infoprint.com
Fri, 8 Jul 2011 15:46:48 -0600


We have been getting periodic build failures when building in afs.  Here is
/var/log/messages at the time of the failure:

Jul  6 20:58:01 hostname /usr/sbin/cron[10555]:
pam_krb5_compiled(crond:account): pam_sm_acct_mgmt: exit (ignore)
Jul  6 20:58:01 hostname /usr/sbin/cron[10556]: (userid1) CMD (${K5S_USERID1}
-- /bin/sh -c "/a/p/cpui/build/dir/check_for_bld_request.sh >
/a/p/cpui/build/dir/hostname_check.out 2>&1" >> /home/userid1/k5start.out 2>&1)
Jul  6 20:58:02 hostname kernel: afs: Lost contact with file server
192.168.15.33 in cell cellname.com (all multi-homed ip addresses down for the
server)
Jul  6 20:58:02 hostname kernel: afs: Lost contact with file server
192.168.15.33 in cell cellname.com (all multi-homed ip addresses down for the
server)
Jul  6 20:58:04 hostname kernel: afs: Tokens for user of AFS id -1 for cell
cellname.com have expired
Jul  6 20:58:07 hostname kernel: afs: failed to store file (110)
Jul  6 20:58:08 hostname kernel: afs: failed to store file (110)
Jul  6 20:58:08 hostname kernel: afs: failed to store file (110)
Jul  6 20:58:23 hostname kernel: afs: file server 192.168.15.33 in cell
cellname.com is back up (multi-homed address; other same-host interfaces may
still be down)
Jul  6 20:58:23 hostname kernel: afs: file server 192.168.15.33 in cell
cellname.com is back up (multi-homed address; other same-host interfaces may
still be down)
Jul  6 21:00:01 hostname /usr/sbin/cron[10588]:
pam_krb5_compiled(crond:account): pam_sm_acct_mgmt: entry (0x8000)
Jul  6 21:00:01 hostname /usr/sbin/cron[10587]:
pam_krb5_compiled(crond:account): pam_sm_acct_mgmt: entry (0x8000)

I don't find any errors on the file server that it loses connection to.  Our
afs servers are on AIX and the client system running the build is opensuse 11.1
with the afs client at 1.4.11.

Every minute a simple script (hostname_check.out) goes out to afs and looks for
a file.  Most times there are no problems.  Occasionally this harmless script
running seems to mess up the connection to the file server for the running
build (the Lost contact error always occurs a few seconds after the minute).
Also, we have moved the script to run at 20 seconds after the minute and the
errors follow the same pattern only 20 seconds later.  This has happened with
multiple scripts that access afs so the scripts themselves don't seem to be the
problem.

The build uses k5start for creds which seems fine.  The errors are on different
systems (all opensuse) at random times so it is hard to trace.  Also we
increased the size of the afs cache to 5g hoping that would help and it didn't
seem to help.

Any ideas?

Mark Henry
Advisory Software Engineer
Ricoh Production Print Solutions, LLC


_____________________________________________________________________________
"This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Thank you." _____________________________________________________________________________