[OpenAFS-devel] fileserver deadlocks after internal error in callback.c

Rainer Toebbicke rtb@pclella.cern.ch
Mon, 14 Apr 2003 14:34:11 +0200


This is a multi-part message in MIME format.
--------------040401050305060302090505
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

If the (pthreaded-) fileserver encounters an internal error in the callback 
structures, it'll call ShutDown() (callback.c). Same for certain situations in 
host.c.

In most cases ShutDown() is called with the H_LOCK lock held, and will 
eventually call PrintCounters(). This in turn calls routines that acquire all 
sorts of locks, in particular H_LOCK in h_GetWorkStats(), at which point a 
deadlock situation arises.

Generally speaking relying on the correctness of too many internal structures 
is unhealthy once it becomes obvious that something is wrong to a point that 
you're ready to give up completely.

The attached patch modifies callback.c to call ShutDownAndCore(PANIC) instead 
of simply ShutDown *and* skips calling PrintCounters() when 'dopanic' is set.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke        http://cern.ch/~rtb         rtb@mail.cern.ch  O__
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland   > |
Phone: +41 22 767 8985       Fax: +41 22 767 7155                     ( )\( )

--------------040401050305060302090505
Content-Type: text/plain;
 name="p_PanicNoDeadlock"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="p_PanicNoDeadlock"

*** openafs/src/viced/viced.c.org	Mon Feb 10 10:31:51 2003
--- openafs/src/viced/viced.c	Fri Apr 11 10:21:36 2003
***************
*** 874,880 ****
      }
  #endif
      DFlush();
!     PrintCounters();
  
      /* do not allows new reqests to be served from now on, all new requests
         are returned with an error code of RX_RESTARTING ( transient failure ) */
--- 874,880 ----
      }
  #endif
      DFlush();
!     if (!dopanic) PrintCounters();
  
      /* do not allows new reqests to be served from now on, all new requests
         are returned with an error code of RX_RESTARTING ( transient failure ) */
*** openafs/src/viced/callback.c.org	Thu Mar 27 14:30:05 2003
--- openafs/src/viced/callback.c	Fri Apr 11 10:26:49 2003
***************
*** 443,449 ****
  	assert(0);
  	ViceLog(0,("CDel: Internal Error -- shutting down: wanted %d from %d, now at %d\n",cbi,fe->firstcb,*cbp));
  	DumpCallBackState();
! 	ShutDown();
        }
      }
      CDelPtr(fe, cbp);
--- 443,449 ----
  	assert(0);
  	ViceLog(0,("CDel: Internal Error -- shutting down: wanted %d from %d, now at %d\n",cbi,fe->firstcb,*cbp));
  	DumpCallBackState();
! 	ShutDownAndCore(PANIC);
        }
      }
      CDelPtr(fe, cbp);
***************
*** 493,499 ****
  	if (safety > cbstuff.nblks) {
  	  ViceLog(0,("FindCBPtr: Internal Error -- shutting down.\n"));
  	  DumpCallBackState();
! 	  ShutDown();
  	}
  	cb = itocb(*cbp);
  	if (cb->hhead == hostindex)
--- 493,499 ----
  	if (safety > cbstuff.nblks) {
  	  ViceLog(0,("FindCBPtr: Internal Error -- shutting down.\n"));
  	  DumpCallBackState();
! 	  ShutDownAndCore(PANIC);
  	}
  	cb = itocb(*cbp);
  	if (cb->hhead == hostindex)
***************
*** 696,702 ****
  	if (safety > cbstuff.nblks) {
  	  ViceLog(0,("AddCallBack1: Internal Error -- shutting down.\n"));
  	  DumpCallBackState();
! 	  ShutDown();
  	}
  	if (cb->hhead == h_htoi(host))
  	    break;
--- 696,702 ----
  	if (safety > cbstuff.nblks) {
  	  ViceLog(0,("AddCallBack1: Internal Error -- shutting down.\n"));
  	  DumpCallBackState();
! 	  ShutDownAndCore(PANIC);
  	}
  	if (cb->hhead == h_htoi(host))
  	    break;
***************
*** 1443,1449 ****
  		if (ntimedout > cbstuff.nblks) {
  		  ViceLog(0,("CCB: Internal Error -- shutting down...\n"));
  		  DumpCallBackState();
! 		  ShutDown();
  		}
  	    } while (cbi != *thead);
  	    *thead = 0;
--- 1443,1449 ----
  		if (ntimedout > cbstuff.nblks) {
  		  ViceLog(0,("CCB: Internal Error -- shutting down...\n"));
  		  DumpCallBackState();
! 		  ShutDownAndCore(PANIC);
  		}
  	    } while (cbi != *thead);
  	    *thead = 0;

--------------040401050305060302090505--