[OpenAFS] Re: errors in afs when multiple tasks are running

Mark Henry mark.henry@infoprint.com
Mon, 11 Jul 2011 10:57:04 -0600


This is a response to two replys.

>From Jeffrey Altman:
> I would be interested in seeing a tcpdump of traffic between the affected
> client and the file server covering the few minutes before the build failure
> until the client marks the server up again.

I would too.  We have had a really hard time duplicating this issue and it
happens about monthly on random servers.  This is why I have not grabbed a
tcpdump yet.  Maybe I can work on that since that seems like the needed info.

>From Andrew Deason:
>> Jul  6 20:58:04 hostname kernel: afs: Tokens for user of AFS id -1 for cell
>> cellname.com have expired

> Does this always show up at around the same time? Any idea why someone's
> tokens are expiring / how long are these compile jobs running for?

The tokens expired error occurs with the failed to store file error.  The
problem is that it occurs very often in the log file even when all is
working well.  The job runs for around 7 hours and k5start is used.  Most
of the time the build completes just fine.

> One thing that's easy to try is to disable the fileserver abort
> throttling by passing '-abortthreshold 0' in the fileserver arguments,
> as that can affect behavior like this. We don't really offer any good
> way to detect if that's the actual problem, but just trying to disable
> them is easy to do. It does require restarting the fileserver, though.

The fileserver daemon is already running with '-abortthreshold 0'.

> Or, well... it may be detectable if you are willing to share with me a
> core of the fileserver captured while this is happening, but even then
> iirc it may annoying to go through.

I am not aware of any core files created on the fileserver as a result of
this issue.

Why would the contact to the fileserver be lost just because a second
script gets kicked off in afs?

Mark Henry
Advisory Software Engineer
Ricoh Production Print Solutions, LLC
----- Forwarded by Mark Henry/US/InfoPrint on 07/11/2011 10:37 AM -----
                                                                               
             Mark                                                              
             Henry/US/InfoPrint                                                
                                                                            To 
             07/08/2011 03:46 PM         openafs-info@openafs.org              
                                                                            cc 
                                                                               
                                                                       Subject 
                                         errors in afs when multiple tasks are 
                                         running                               
                                                                               
                                                                               
                                                                               
                                                                               
                                                                               
                                                                               



We have been getting periodic build failures when building in afs.  Here is
/var/log/messages at the time of the failure:

Jul  6 20:58:02 hostname kernel: afs: Lost contact with file server
192.168.15.33 in cell cellname.com (all multi-homed ip addresses down for the
server)
Jul  6 20:58:02 hostname kernel: afs: Lost contact with file server
192.168.15.33 in cell cellname.com (all multi-homed ip addresses down for the
server)
Jul  6 20:58:04 hostname kernel: afs: Tokens for user of AFS id -1 for cell
cellname.com have expired
Jul  6 20:58:07 hostname kernel: afs: failed to store file (110)
Jul  6 20:58:08 hostname kernel: afs: failed to store file (110)
Jul  6 20:58:08 hostname kernel: afs: failed to store file (110)
Jul  6 20:58:23 hostname kernel: afs: file server 192.168.15.33 in cell
cellname.com is back up (multi-homed address; other same-host interfaces may
still be down)
Jul  6 20:58:23 hostname kernel: afs: file server 192.168.15.33 in cell
cellname.com is back up (multi-homed address; other same-host interfaces may
still be down)

I don't find any errors on the file server that it loses connection to.  Our
afs servers are on AIX and the client system running the build is opensuse 11.1
with the afs client at 1.4.11.

Every minute a simple script (hostname_check.out) goes out to afs and looks for
a file.  Most times there are no problems.  Occasionally this harmless script
running seems to mess up the connection to the file server for the running
build (the Lost contact error always occurs a few seconds after the minute).
Also, we have moved the script to run at 20 seconds after the minute and the
errors follow the same pattern only 20 seconds later.  This has happened with
multiple scripts that access afs so the scripts themselves don't seem to be the
problem.

The build uses k5start for creds which seems fine.  The errors are on different
systems (all opensuse) at random times so it is hard to trace.  Also we
increased the size of the afs cache to 5g hoping that would help and it didn't
seem to help.

Any ideas?

Mark Henry
Advisory Software Engineer
Ricoh Production Print Solutions, LLC


_____________________________________________________________________________
"This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Thank you." _____________________________________________________________________________