[OpenAFS-devel] linux 2.4 client potential bug

Derrick J Brashear shadow@dementia.org
Sun, 13 Jan 2002 03:03:52 -0500 (EST)


While doing "test suite testing" as well as patch integration and
verification for the 1.2.3 release I found what I believe to be a bug but
which may be in the Linux kernel itself. Under 2.4.9-13 (the Redhat 7.2
update kernel) I've noticed in certain situations an Rx call will become
"stuck". In tcpdump you'll see something like:
-client makes call n
-server sends reply n
-client makes call n+1 (implicitly acking reply n)
-server sends reply n (apparently having missed call n+1)
-server sends reply n (several more times)

The client never retransmits, but is not hung. It's much more likely to be
observable on a congested and/or slow network as you're more likely to
lose packets.

Testing was being done with find in afs-space and in this instance fstrace
with added events for testing purposes shows RXAFS_FetchStatus never
returns. We get to:
        z_xdrs.x_op = XDR_DECODE;
        if ((!xdr_AFSFetchStatus(&z_xdrs, OutStatus))
             || (!xdr_AFSCallBack(&z_xdrs, CallBack))
             || (!xdr_AFSVolSync(&z_xdrs, Sync))) {
and one of these (it may vary, I'm not sure yet) never returns. The rx
connection per rxdebug is left active with output packets pending.

As I have been as yet unable to reproduce the problem with 2.4.7-10
(Redhat 7.2 base kernel) with same hardware and OpenAFS source, I can only
conclude that either:
-something broke in the kernel
-something changed in the kernel that we didn't expect and haven't dealt with

I will share further details if and when I find them.

-D