[OpenAFS-devel] Same problem in a different place.

Dan Lapine dlapine@ncsa.uiuc.edu
Wed, 30 Jan 2002 12:33:21 -0600 (CST)



On Tue, 29 Jan 2002, Matt Peterson wrote:

> Nickoali,
> 
> I was testing what I think is the latest code (osi_sleep.c rev 1.11) that 
> appears to have your change we've discussed on the list.
> 
> What I found is that there is still one more place we are seeing cpu hog.  I 
> spent some time and tracked it all the way through a ton of code to rx_Read() 
> and eventually to CV_WAIT() in rx/LINUX/rx_kmutex.h.  
> 
> In rx/rx_rdwr.c:225 there is a while() loop that "waits" for a in-sequence 
> datagram.  It looks like the following:
> 
>           while (call->flags & RX_CALL_READER_WAIT) {
> #ifdef  RX_ENABLE_LOCKS
>               CV_WAIT(&call->cv_rq, &call->lock);
> #else
>               osi_rxSleep(&call->rq);
> #endif
>           }
> 
> This loop produces the same behavior as the one you just fixed in 
> LINUX/osi_sleep.c in that any task that receives a signal when waiting for 
> rcpt of a packet will loop very tightly.
> 
> As you can imagine, this behavior is extremely easy to reproduce and often 
> manifests itself when users expect the CTRL-C keystroke (or SIGINT) to 
> terminate a cp, or ls command that is taking too long.

This is known as the "ls of death". If a user should mistakenly do a 
"ls /afs" when the afs servers use a wide number sites, linux appears to 
hang. Response times from MacMurdo station (antartica) really bite. I 
think they use "tcp/ip over penguin". 

If you fine fellows can fix this for linux, NCSA would appreciate it. Our 
current installation for default users recommends removing all external 
sites from the CellServDB files to avoid the appearance of a system hang.


> 
> As a short term solution I'd recommend that you put the same fix you put in 
> LINUX/osi_sleep.c into the CV_WAIT() function in LINUX/rx_kmutex.h.
> 
> Still, a long term solution for both "signal during wait" problems needs 
> to be put into the code.  I should have a few cycles in the next few days and 
> would be happy to help.  
> 
> If you make the changes to add the additional parameter to the "sleep/wait" 
> functions as you proposed earlier
> 
>    int afs_osi_Sleep(char *event, int aintrok);
> 
> and now...?
> 
>    int CV_WAIT(afs_kcondvar_t *cv, afs_kmutex_t *l, int aintrok)
> 
> I can help scour the code for places that the parameters should be used so 
> that signals are handled in the appropriate places.  
> 
> Actually, since afs_osi_Sleep() and CV_WAIT() are used so frequently, it 
> might be easier to make just make two new calls:
> 
>    int afs_osi_SleepHonorSig(char *event, int aintrok);
>    int CV_WAIT_HONORSIG(afs_kcondvar_t *cv, afs_kmutex_t *l, int aintrok)
> 
> This way you could #define them to the existing (non-signal calls) for 
> non-linux platforms.  Slowly as time and review permits you can replace the 
> old calls with the "HonorSig" calls in places where it is possible to return 
> the appropriate error code to userland.
> 
> 

-- 
---
Daniel LaPine, System Engineer
National Center for Supercomputing Applications (NCSA)
email: dlapine@ncsa.uiuc.edu
phone: 217-244-9294