[OpenAFS] connection timeout errors
Todd DeSantis
atd@us.ibm.com
Fri, 6 Jun 2003 10:37:28 -0400
Hi -
Here are some things that you might try/check if you
suspect that the clients are somehow losing their
NAT mappings.
- set the NAT mapping timeouts to 30 minutes.
some customers have seen success when going to
longer timeouts longer than 15 minutes. Try
30 minutes and then work your way down until
you see problems.
There are RPCs going between the fileserver and
client every so often even if the client does
not need to get information on any data. The client
might send a check every 10 minutes and the same
with the fileserver. This should be enough to
keep the NAT mapping around as long as the timeout
is greater than this number. If a fileserver has
lots of hosts connecting to it and some drop off the
network, this 10 minutes cycle can actually be longer
than 10 minutes, so that is why we increased the time
to 30 minutes to see if this helps and then work are
way back down.
- you can always take a callback dump from the
fileserver via
kill -XCPU <fileserver pid>
This will create 3 files in /usr/afs/local
callback.dump data file - don't need
clients.dump ascii file for client/user connections
hosts.dump ascii file with host connection data
this is the file we would be interested
in
In the hosts.dump file, you can search for the hex equivalent of
your NATed client IPs and look to see what the fileserver thinks
about this machine. Are there multiple entries for this IP ?
Did the real IP of the client show up in this list - it shouldn't
be there. What is the port associated with this entry, 7001 or
something else ?
*** The -XCPU signal will block calls to the fileserver while these
3 files are created. If the number of connections/hosts hitting
this fileserver is large, this can take many minutes to
complete. You might see clients getting "waiting for busy volume"
messages when sending this signal. Just a warning here.
- Also, you might want to look at the messages in the FileLog regarding
the hex IP of the clients in question ? They might mention RCallback
or "possible network or routing" problems. When did these messages
start
showing up in the FileLog ? Does that coincide with anything on the
client machine ==> a reboot, a NAT firewall reboot, etc.
Thanks
Todd
Derek Atkins
<warlord@MIT.EDU> To: Elliot Peele <ebpeele2@pams.ncsu.edu>
Sent by: cc: openafs-info@openafs.org
openafs-info-admin@ Subject: Re: [OpenAFS] connection timeout errors
openafs.org
06/05/2003 11:09 AM
Well, the bug that I was thinking about would occur with the IP/port
would change (from the vantage point of the fileserver). So,
rebooting the NAT box would effectively cause this bug (as would any
other NAT mapping lossage). Is it possible that the affected machines
are somehow losing their NAT mappings?
Without seeing a packet trace it's hard to know what's going on. :(
-derek
Elliot Peele <ebpeele2@pams.ncsu.edu> writes:
> I haven't tried sniffing the trafic to see what exactly is happening
> yet. If I can get the connection timeouts to reproduce them selves, I'll
> try it tomorrow.
>
> I've noticed that if I reboot the firewall and delete the afs cache on
> the client machine the problem goes away, but this is not viable option.
>
> Elliot
>
> On Wed, 2003-06-04 at 18:08, Derek Atkins wrote:
> > Hmm, then I dont know what to suggest to you... AFS behind a NAT is
> > just... weird. It usually works, but it can get into strange states
> > sometimes. There were a few bugs in the fileserver where it would
> > try to callback to the wrong address and fail to get a WhoAreYou
> > response.
> >
> > Have you tried running a network sniffer on both sides of the NAT
> > box to see what's going on with the failed connections?
> >
> > -derek
> >
> > Elliot Peele <ebpeele2@pams.ncsu.edu> writes:
> >
> > > These are desktop that are 100% of the time behind the NAT.
> > >
> > > Elliot
> > >
> > > On Wed, 2003-06-04 at 17:30, Derek Atkins wrote:
> > > > Are these users on laptops or are they _ALWAYS_, 100% behind the
NAT?
> > > >
> > > > -derek
> > > >
> > > > Elliot Peele <ebpeele2@pams.ncsu.edu> writes:
> > > >
> > > > > Hi,
> > > > >
> > > > > I thought I'd try this again worded a bit different and with a
different
> > > > > subject. I have several users that keep getting connection
timeout
> > > > > errors when trying to access there volumes from behind a
firewall. I
> > > > > believe this may be a problem with the udp timeouts. They are
OpenAFS
> > > > > clients connecting to Transarc AFS server through an iptables
NATing
> > > > > firewall running on Red Hat Linux 7.3 currently with kernel
> > > > > 2.4.18-24.7.x.
> > > > >
> > > > > Thanks
> > > > >
> > > > > Elliot
--
Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
Member, MIT Student Information Processing Board (SIPB)
URL: http://web.mit.edu/warlord/ PP-ASEL-IA N1NWH
warlord@MIT.EDU PGP key available
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info