[OpenAFS-devel] Re: progress... sortof...

Wed, 28 Apr 2004 20:05:12 -0400

(Switching back to openafs-devel for wider audience.)

Found where it's spinning... Added some instrumentation to the relevant
routines... DAMN PutVCache gets called a lot... I see about a dozen
routines in the kernel code that I _REALLY_ would like to know why they
aren't inlined... Also seems like there is some serious room for
optimization of the xcache lock when doing lots of putvcache ops in a
row... Another time...

in LINUX/osi_misc.c osi_file_uio_rdwr

UMRTRACE; /* A */

    savelim = current->rlim[RLIMIT_FSIZE].rlim_cur;
    current->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;

    if (uiop->uio_seg == AFS_UIOSYS)
        TO_USER_SPACE();

    filp->f_pos = uiop->uio_offset;
    while (code == 0 && uiop->uio_resid > 0 && uiop->uio_iovcnt > 0) {
        iov = uiop->uio_iov;
        count = iov->iov_len;
        if (count == 0) {
            uiop->uio_iov++;
            uiop->uio_iovcnt--;
            continue;
        }

UMRTRACE;   /* B around line 195 */

        if (rw == UIO_READ)
            code = FOP_READ(filp, iov->iov_base, count);
        else
            code = FOP_WRITE(filp, iov->iov_base, count);

        if (code < 0) {
            code = -code;
            break;
        }

        iov->iov_base += code;
        iov->iov_len -= code;
        uiop->uio_resid -= code;
        uiop->uio_offset += code;
        code = 0;
    }

    if (uiop->uio_seg == AFS_UIOSYS)
        TO_KERNEL_SPACE();

    current->rlim[RLIMIT_FSIZE].rlim_cur = savelim;

UMRTRACE; /* C */

Basically seems to get stuck hitting B repeatedly as fast as it can
without hitting A or C. Works fine for a while, but then gets stuck here.

iov_len and uio_offset are unsigned, but uio_resid is not... iov_base is a void *. If somehow iov_len underflowed, it would likely cause it to loop forever...

I'm going to strip out my existing instrumentation and add some more to
this loop, but if you think of anything useful here, fire me a note,
cause I'm able to reproduce this very reliably now on some machines.
(For some strange reason, I cannot reproduce it on a VMWare-ESX virtual
machine, even though one another vmware-esx box is hitting this problem
repeatedly.

If nothing else, I'm going to brute force the damn thing with a counter
and exit after too many loops. 

-- Nathan

On Wed, Apr 28, 2004 at 05:56:06PM -0400, chas williams (contractor) wrote:
> i would strongly suggest not using -files 50000.  the default really
> should be big enough.  -files 50000 allocates a massive 6m array
> in the kernel.  you are making life difficult for linux.
> 
> as for something broken and spinning forever, i am going to blame
> tryflushdcachechildren but i dont have any proof as of yet.

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
UMR Information Technology             Fax: (573) 341-4216