[OpenAFS] "stillborn client" in src/viced/host.c

Bill Stivers stiversb@ucsc.edu
Tue, 17 Oct 2006 14:11:32 -0700


Hey all:

Thanks much for the tremendous help you've provided to me and my  
cohorts.

I have another lame question.  I've been speaking to Joe about error  
information he's seeing on our AFS servers, and he noted one  
particularly odd one:

Tue Oct 17 13:15:59 2006 FindClient: stillborn client b3f528(b17753cc);
conn b5b348 (host 128.114.104.230:7001) had client b4c030(b17753cc)
Mon Oct 16 19:27:28 2006 FindClient: stillborn client b3f7c8(2b1cc290);
conn b450d0 (host 128.114.30.230:7001) had client 6d52f0(2b1cc290)

I looked at the code, and found the lines that are generating the  
message in src/viced/host.c, which are as follows:

     /* Now, tcon may already be set to a rock, since we blocked with  
no host
      * or client locks set above in pr_GetCPS (XXXX some locking is  
probably
      * required).  So, before setting the RPC's rock, we should  
disconnect
      * the RPC from the other client structure's rock.
      */
     oldClient = (struct client *)rx_GetSpecific(tcon,  
rxcon_client_key);
     if (oldClient && oldClient->tcon == tcon) {
         char hoststr[16];
         if (!oldClient->deleted) {
             /* if we didn't create it, it's not ours to put back */
             if (created) {
                 ViceLog(0, ("FindClient: stillborn client %x(%x);  
conn %x (host %s:%d) had cl
ient %x(%x)\n",
                             client, client->sid, tcon,
                             afs_inet_ntoa_r(rxr_HostOf(tcon), hoststr),
                             ntohs(rxr_PortOf(tcon)),
                             oldClient, oldClient->sid));
                 if ((client->ViceId != ANONYMOUSID) && client- 
 >CPS.prlist_val)
                     free(client->CPS.prlist_val);
                 client->CPS.prlist_val = NULL;
                 client->CPS.prlist_len = 0;
                 if (client->tcon) {
                     rx_SetSpecific(client->tcon, rxcon_client_key,  
(void *)0);
                 }
             }


Can someone who knows the codebase well shed some light as to what's  
going on?  Is this another one of those: "You have OpenAFS in part of  
your infrastructure and TransARC in part of it" issues?  is this,  
perhaps, part of the locking code?

I'm trying to do due diligence to make sure my clients aren't  
partially to blame for some of the things that our server  
administrators are fixing now, and this is part of that effort.

Any information/advice would be appreciated.

---
Bill Stivers
IC Unix Lab and Systems Administrator
University of California at Santa Cruz
stiversb@ucsc.edu
v) 831-459-2472
f) 831-459-2914