[OpenAFS] connection timeout errors

Elliot Peele ebpeele2@pams.ncsu.edu
26 Jun 2003 15:52:08 -0400


--=-wlOsu8zWLLzlgP5aH0ca
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

On Fri, 2003-06-06 at 10:37, Todd DeSantis wrote:
>=20
>=20
> Hi -
>=20
> Here are some things that you might try/check if you
> suspect that the clients are somehow losing their
> NAT mappings.
>=20
>  - set the NAT mapping timeouts to 30 minutes.
>    some customers have seen success when going to
>    longer timeouts longer than 15 minutes.  Try
>    30 minutes and then work your way down until
>    you see problems.

Currently the timeouts are set at 30 min. It really isn't viable for me
to keep adjusting the timeouts because I can't keep rebooting my
firewall all the time. As it is now I have to reboot every one to three
weeks to fix the current problem.

>    There are RPCs going between the fileserver and
>    client every so often even if the client does
>    not need to get information on any data.  The client
>    might send a check every 10 minutes and the same
>    with the fileserver.  This should be enough to
>    keep the NAT mapping around as long as the timeout
>    is greater than this number.  If a fileserver has
>    lots of hosts connecting to it and some drop off the
>    network, this 10 minutes cycle can actually be longer
>    than 10 minutes, so that is why we increased the time
>    to 30 minutes to see if this helps and then work are
>    way back down.

>  - you can always take a callback dump from the
>    fileserver via

This doesn't work for me, because I don't have access to the fileserver
other than through AFS.

>       kill -XCPU <fileserver pid>
>=20
>    This will create 3 files in /usr/afs/local
>       callback.dump     data file - don't need
>       clients.dump      ascii file for client/user connections
>       hosts.dump        ascii file with host connection data
>                         this is the file we would be interested
>                         in
>=20
>    In the hosts.dump file, you can search for the hex equivalent of
>    your NATed client IPs and look to see what the fileserver thinks
>    about this machine.  Are there multiple entries for this IP ?
>    Did the real IP of the client show up in this list - it shouldn't
>    be there.  What is the port associated with this entry, 7001 or
>    something else ?
>=20
>    ***   The -XCPU signal will block calls to the fileserver while these
>          3 files are created.  If the number of connections/hosts hitting
>          this fileserver is large, this can take many minutes to
>          complete.  You might see clients getting "waiting for busy volum=
e"
>          messages when sending this signal.  Just a warning here.
>=20
>  - Also, you might want to look at the messages in the FileLog regarding
>    the hex IP of the clients in question ?  They might mention RCallback
>    or "possible network or routing" problems.  When did these messages
> start
>    showing up in the FileLog ?  Does that coincide with anything on the
>    client machine =3D=3D> a reboot, a NAT firewall reboot, etc.
>=20
> Thanks
>=20
> Todd

Thanks

Elliot

>=20
>=20
>=20
>=20
>=20
>                                                                          =
                                                      =20
>                       Derek Atkins                                       =
                                                      =20
>                       <warlord@MIT.EDU>          To:       Elliot Peele <=
ebpeele2@pams.ncsu.edu>                               =20
>                       Sent by:                   cc:       openafs-info@o=
penafs.org                                            =20
>                       openafs-info-admin@        Subject:  Re: [OpenAFS] =
connection timeout errors                             =20
>                       openafs.org                                        =
                                                      =20
>                                                                          =
                                                      =20
>                                                                          =
                                                      =20
>                       06/05/2003 11:09 AM                                =
                                                      =20
>                                                                          =
                                                      =20
>                                                                          =
                                                      =20
>=20
>=20
>=20
>=20
> Well, the bug that I was thinking about would occur with the IP/port
> would change (from the vantage point of the fileserver).  So,
> rebooting the NAT box would effectively cause this bug (as would any
> other NAT mapping lossage).  Is it possible that the affected machines
> are somehow losing their NAT mappings?
>=20
> Without seeing a packet trace it's hard to know what's going on. :(
>=20
> -derek
>=20
> Elliot Peele <ebpeele2@pams.ncsu.edu> writes:
>=20
> > I haven't tried sniffing the trafic to see what exactly is happening
> > yet. If I can get the connection timeouts to reproduce them selves, I'l=
l
> > try it tomorrow.
> >
> > I've noticed that if I reboot the firewall and delete the afs cache on
> > the client machine the problem goes away, but this is not viable option=
.
> >
> > Elliot
> >
> > On Wed, 2003-06-04 at 18:08, Derek Atkins wrote:
> > > Hmm, then I dont know what to suggest to you...  AFS behind a NAT is
> > > just... weird.  It usually works, but it can get into strange states
> > > sometimes.  There were a few bugs in the fileserver where it would
> > > try to callback to the wrong address and fail to get a WhoAreYou
> > > response.
> > >
> > > Have you tried running a network sniffer on both sides of the NAT
> > > box to see what's going on with the failed connections?
> > >
> > > -derek
> > >
> > > Elliot Peele <ebpeele2@pams.ncsu.edu> writes:
> > >
> > > > These are desktop that are 100% of the time behind the NAT.
> > > >
> > > > Elliot
> > > >
> > > > On Wed, 2003-06-04 at 17:30, Derek Atkins wrote:
> > > > > Are these users on laptops or are they _ALWAYS_, 100% behind the
> NAT?
> > > > >
> > > > > -derek
> > > > >
> > > > > Elliot Peele <ebpeele2@pams.ncsu.edu> writes:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I thought I'd try this again worded a bit different and with a
> different
> > > > > > subject. I have several users that keep getting connection
> timeout
> > > > > > errors when trying to access there volumes from behind a
> firewall. I
> > > > > > believe this may be a problem with the udp timeouts. They are
> OpenAFS
> > > > > > clients connecting to Transarc AFS server through an iptables
> NATing
> > > > > > firewall running on Red Hat Linux 7.3 currently with kernel
> > > > > > 2.4.18-24.7.x.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Elliot
>=20
> --
>        Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
>        Member, MIT Student Information Processing Board  (SIPB)
>        URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
>        warlord@MIT.EDU                        PGP key available
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>=20
>=20
>=20
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info

--=-wlOsu8zWLLzlgP5aH0ca
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQA++07nmSqoIAXFTXMRAn94AKCkMhUweu5Ua1TzycB0POenkpVOUACglfE0
+SfIyUkenNdACkKEW09CbME=
=X5CX
-----END PGP SIGNATURE-----

--=-wlOsu8zWLLzlgP5aH0ca--