[OpenAFS] connection timeout errors

Todd DeSantis atd@us.ibm.com
Fri, 6 Jun 2003 10:37:28 -0400




Hi -

Here are some things that you might try/check if you
suspect that the clients are somehow losing their
NAT mappings.

 - set the NAT mapping timeouts to 30 minutes.
   some customers have seen success when going to
   longer timeouts longer than 15 minutes.  Try
   30 minutes and then work your way down until
   you see problems.

   There are RPCs going between the fileserver and
   client every so often even if the client does
   not need to get information on any data.  The client
   might send a check every 10 minutes and the same
   with the fileserver.  This should be enough to
   keep the NAT mapping around as long as the timeout
   is greater than this number.  If a fileserver has
   lots of hosts connecting to it and some drop off the
   network, this 10 minutes cycle can actually be longer
   than 10 minutes, so that is why we increased the time
   to 30 minutes to see if this helps and then work are
   way back down.

 - you can always take a callback dump from the
   fileserver via

      kill -XCPU <fileserver pid>

   This will create 3 files in /usr/afs/local
      callback.dump     data file - don't need
      clients.dump      ascii file for client/user connections
      hosts.dump        ascii file with host connection data
                        this is the file we would be interested
                        in

   In the hosts.dump file, you can search for the hex equivalent of
   your NATed client IPs and look to see what the fileserver thinks
   about this machine.  Are there multiple entries for this IP ?
   Did the real IP of the client show up in this list - it shouldn't
   be there.  What is the port associated with this entry, 7001 or
   something else ?

   ***   The -XCPU signal will block calls to the fileserver while these
         3 files are created.  If the number of connections/hosts hitting
         this fileserver is large, this can take many minutes to
         complete.  You might see clients getting "waiting for busy volume"
         messages when sending this signal.  Just a warning here.

 - Also, you might want to look at the messages in the FileLog regarding
   the hex IP of the clients in question ?  They might mention RCallback
   or "possible network or routing" problems.  When did these messages
start
   showing up in the FileLog ?  Does that coincide with anything on the
   client machine ==> a reboot, a NAT firewall reboot, etc.

Thanks

Todd







                                                                                                                                
                      Derek Atkins                                                                                              
                      <warlord@MIT.EDU>          To:       Elliot Peele <ebpeele2@pams.ncsu.edu>                                
                      Sent by:                   cc:       openafs-info@openafs.org                                             
                      openafs-info-admin@        Subject:  Re: [OpenAFS] connection timeout errors                              
                      openafs.org                                                                                               
                                                                                                                                
                                                                                                                                
                      06/05/2003 11:09 AM                                                                                       
                                                                                                                                
                                                                                                                                




Well, the bug that I was thinking about would occur with the IP/port
would change (from the vantage point of the fileserver).  So,
rebooting the NAT box would effectively cause this bug (as would any
other NAT mapping lossage).  Is it possible that the affected machines
are somehow losing their NAT mappings?

Without seeing a packet trace it's hard to know what's going on. :(

-derek

Elliot Peele <ebpeele2@pams.ncsu.edu> writes:

> I haven't tried sniffing the trafic to see what exactly is happening
> yet. If I can get the connection timeouts to reproduce them selves, I'll
> try it tomorrow.
>
> I've noticed that if I reboot the firewall and delete the afs cache on
> the client machine the problem goes away, but this is not viable option.
>
> Elliot
>
> On Wed, 2003-06-04 at 18:08, Derek Atkins wrote:
> > Hmm, then I dont know what to suggest to you...  AFS behind a NAT is
> > just... weird.  It usually works, but it can get into strange states
> > sometimes.  There were a few bugs in the fileserver where it would
> > try to callback to the wrong address and fail to get a WhoAreYou
> > response.
> >
> > Have you tried running a network sniffer on both sides of the NAT
> > box to see what's going on with the failed connections?
> >
> > -derek
> >
> > Elliot Peele <ebpeele2@pams.ncsu.edu> writes:
> >
> > > These are desktop that are 100% of the time behind the NAT.
> > >
> > > Elliot
> > >
> > > On Wed, 2003-06-04 at 17:30, Derek Atkins wrote:
> > > > Are these users on laptops or are they _ALWAYS_, 100% behind the
NAT?
> > > >
> > > > -derek
> > > >
> > > > Elliot Peele <ebpeele2@pams.ncsu.edu> writes:
> > > >
> > > > > Hi,
> > > > >
> > > > > I thought I'd try this again worded a bit different and with a
different
> > > > > subject. I have several users that keep getting connection
timeout
> > > > > errors when trying to access there volumes from behind a
firewall. I
> > > > > believe this may be a problem with the udp timeouts. They are
OpenAFS
> > > > > clients connecting to Transarc AFS server through an iptables
NATing
> > > > > firewall running on Red Hat Linux 7.3 currently with kernel
> > > > > 2.4.18-24.7.x.
> > > > >
> > > > > Thanks
> > > > >
> > > > > Elliot

--
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord@MIT.EDU                        PGP key available
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info