[OpenAFS] openafs client on Linux-s390x sometimes not replying to whoareyou() calls

Jason Edgecombe jason@rampaginggeek.com
Fri, 29 Feb 2008 18:07:16 -0500


Carsten Jacobi wrote:
> Hello,
>
> starting from last month we have been facing "Lost contact with fileserver"
> situations on one of our zLinux systems (Novell SLES-9 distribution).
> After further investigation we have found out, that the cause for the
> "Lost contact" hanger seems to be our AFS client (version 1.4.5) not
> replying to
> whoareyou() calls from the fileserver.
> We have used tcpdump to record all packages we hope are essential to
> track the problem. For example, we see the whoareyou() call replied by
> our AFS-Client in about 40 to 100 µsec in normal operation:
>
> 10:30:00.945453 IP fs13.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
> 10:30:00.945499 IP mclinx.xxx.xx.xx.xx.afs3-callback >
> fs13.xxx.xx.xx.xx.afs3-fileserver:  rx data cb reply whoareyou (460)
> 10:30:08.941373 IP fs20.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
> 10:30:08.941455 IP mclinx.xxx.xx.xx.xx.afs3-callback >
> fs20.xxx.xx.xx.xx.afs3-fileserver:  rx data cb reply whoareyou (460)
> 10:30:08.952207 IP fs25.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
> 10:30:08.952266 IP mclinx.xxx.xx.xx.xx.afs3-callback >
> fs25.xxx.xx.xx.xx.afs3-fileserver:  rx data cb reply whoareyou (460)
> 10:30:24.173003 IP fs13.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
> 10:30:24.173042 IP mclinx.xxx.xx.xx.xx.afs3-callback >
> fs13.xxx.xx.xx.xx.afs3-fileserver:  rx data cb reply whoareyou (460)
> 10:30:24.176168 IP fs11.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
> 10:30:24.176213 IP mclinx.xxx.xx.xx.xx.afs3-callback >
> fs11.xxx.xx.xx.xx.afs3-fileserver:  rx data cb reply whoareyou (460)
>
> mclinx is our AFS client and fsxx are AFS fileservers. We see those
> whoareyou()
> calls and replys any time, but sometimes our client does not response:
>
> 10:31 AM, the first whoareyou from fs15 is not replied
> 10:31:22.808760 IP mclinx.xxx.xx.xx.xx.afs3-callback >
> fs20.xxx.xx.xx.xx.afs3-fileserver:  rx data fs call give-cbs (244)
> 10:31:22.809183 IP mclinx.xxx.xx.xx.xx.afs3-callback >
> fs20.xxx.xx.xx.xx.afs3-fileserver:  rx data fs call give-cbs (88)
> 10:31:22.809368 IP fs20.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call callback fid
> 1802411095/5/1330193 afsuuid [|cb] (52)
> 10:31:22.809602 IP mclinx.xxx.xx.xx.xx.afs3-callback >
> fs15.xxx.xx.xx.xx.afs3-fileserver:  rx data fs call give-cbs (244) (see
> below at 10:35 AM)
> 10:31:22.810046 IP fs15.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
> 10:31:23.134195 IP fs15.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call callback fid
> 1802410300/17405/631375 afsuuid [|cb] (52)
> 10:31:23.163772 IP fs20.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call callback fid
> 1802411095/5/1330193 afsuuid [|cb] (52)
> 10:31:23.166077 IP fs15.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
> 10:31:23.489312 IP fs15.xxx.xx.xx.xx.afs3-fileserver >
> mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call callback fid
> 1802410300/17405/631375 afsuuid [|cb] (52)
>
> Here, fs15 sends a whoareyou() which doesn't get a reply and about
> a third second later another whoareyou() is sent to the AFS-Client on
> mclinx. Neither of them get an answer.
> To make a long story short the fileserver fs15 will send initcb() to
> the AFS-Client two minutes later and another two minutes later
> we'll see the first rx abort packet send to the AFS-Client which will
> make the AFS-Client reporting the "Lost contact" to fs15 on the
> system log (at least this is my interpretation).
> Unfortunately, the AFS-Client won't respond to any whoareyou()
> from other fileservers until 10:45 AM in our log which ends up
> in "Lost contact" with all the fileservers being around and any
> AFS activity freezing in for about a quarter hour until the connections
> are reported "back up" again.
>
> My question is: What can block an AFS-Client from answering
> whoareyou() for several minutes? Are there any limits or restrictions
> that can lead an AFS client to a situation where it is internally blocked?
> Are there parameters one can adjust for tuning in order to avoid this
> situation?
> We have had those "Lost contact" time slots once every two days
> lately and they are painful for users who are logged on our system
> during that time. I would be happy to get rid of them somehow ...
>   

This may be a silly question, but are there any firewalls or NAT running 
on the fileserver, client, or in between?

If the firewall blocks the who are you message, then that would explain 
why there is no reply.

Sincerely,
Jason