[OpenAFS] interpreting cmdebug output of locked entries

Jonathan Nilsson jnilsson@uci.edu
Wed, 2 Nov 2011 11:04:59 -0700


--0016e659f09aa7d2d504b0c450f0
Content-Type: text/plain; charset=ISO-8859-1

Hello list,

I've got a web server that is serving php (drupal) sites from various AFS
volumes, and was running fine with an uptime of 160 days until a few days
ago. It has locked up twice in the past two days.  After a hard reboot, it
does come back up okay and ran for slightly longer than 24 hours without
incident before hanging again. Although I don't know for sure that AFS is
the problem, I am seeing some indications (references to max execution time
exceeded on php files in /afs/...).

CentOS 5.6
Kernel 2.6.18-238.12.1.el5 i386
OpenAFS RPMs 1.4.14-el5.1.1 installed from
dl.openafs.org/dl/1.4.14/rhel5/i386/

I realize this system is fairly out of date, so I will try to schedule
downtime for an update soon (to CentOS 5.7 with OpenAFS 1.6.0). But until
then, I was hoping to try to interpret what may be happening now.

The symptoms are that the server becomes nearly completely unresponsive (no
reply over port 80 or 22, and the console is completely unresponsive).
However, "cmdebug web1 7001" did return:

Lock afs_xdcache status: (writer_waiting, write_locked(pid:3700 at:617), 1
waiters)
** Cache entry @ 0xea61a4c0 for 2.536870959.1.1 [ss2k.uci.edu]
    locks: (none_waiting, 1 read_locks(pid:22600))
            2048 bytes  DV           57  refcnt     2
    callback 00000000 expires 1320206924
    0 opens 0 writers
    volume root
    states (0x4), read-only

Unfortunately, I do not have a process list, so I don't know what pid 3700
is, but most likely it is httpd I assume. Now, trying to determine the file
that this cache entry refers to, I found this page from IBM about using
"calcinode" to translate the volume id and vnode number into an inode
number.
http://www-01.ibm.com/support/docview.wss?uid=swg21112954

$ calcinode 536870959.1
volume ID       = 536870959
vnode number    = 1
inode number    = 3080193

But unfortunately "ls -lRi | grep 3080193" didn't return anything.

Anyone else know how to interpret the output of cmdebug? Or anything else I
should be looking for?

Thanks so much!

-- 
Jonathan.Nilsson at uci dot edu
Social Sciences Computing Services
SSPB 1265 | 949.824.1536

--0016e659f09aa7d2d504b0c450f0
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello list,<div><br></div><div>I&#39;ve got a web server that is serving ph=
p (drupal) sites from various AFS volumes, and was running fine with an upt=
ime of 160 days until a few days ago. It has locked up twice in the past tw=
o days. =A0After a hard reboot, it does come back up okay and ran for sligh=
tly longer than 24 hours without incident before hanging again. Although I =
don&#39;t know for sure that AFS is the problem, I am seeing some indicatio=
ns (references to max execution time exceeded on php files in /afs/...).</d=
iv>

<div><br></div><div>CentOS 5.6</div><div>Kernel 2.6.18-238.12.1.el5 i386</d=
iv><div>OpenAFS RPMs 1.4.14-el5.1.1 installed from <a href=3D"http://dl.ope=
nafs.org/dl/1.4.14/rhel5/i386/">dl.openafs.org/dl/1.4.14/rhel5/i386/</a></d=
iv>

<div><br></div><div>I realize this system is fairly out of date, so I will =
try to schedule downtime for an update soon (to CentOS 5.7 with OpenAFS 1.6=
.0). But until then, I was hoping to try to interpret what may be happening=
 now.</div>

<div><br></div><div>The symptoms are that the server becomes nearly complet=
ely unresponsive (no reply over port 80 or 22, and the console is completel=
y unresponsive). However, &quot;cmdebug web1 7001&quot; did return:</div>

<div><br></div><div><div>Lock afs_xdcache status: (writer_waiting, write_lo=
cked(pid:3700 at:617), 1 waiters)</div><div>** Cache entry @ 0xea61a4c0 for=
 2.536870959.1.1 [<a href=3D"http://ss2k.uci.edu">ss2k.uci.edu</a>]</div>

<div>=A0 =A0 locks: (none_waiting, 1 read_locks(pid:22600))</div><div>=A0 =
=A0 =A0 =A0 =A0 =A0 2048 bytes =A0DV =A0 =A0 =A0 =A0 =A0 57 =A0refcnt =A0 =
=A0 2</div><div>=A0 =A0 callback 00000000<span class=3D"Apple-tab-span" sty=
le=3D"white-space:pre">	</span>expires 1320206924</div>

<div>=A0 =A0 0 opens<span class=3D"Apple-tab-span" style=3D"white-space:pre=
">	</span>0 writers</div><div>=A0 =A0 volume root</div><div>=A0 =A0 states =
(0x4), read-only</div></div><div><br></div><div>Unfortunately, I do not hav=
e a process list, so I don&#39;t know what pid 3700 is, but most likely it =
is httpd I assume. Now, trying to determine the file that this cache entry =
refers to,=A0I found this page from IBM about using &quot;calcinode&quot; t=
o translate the volume id and vnode number into an inode number.</div>

<div><a href=3D"http://www-01.ibm.com/support/docview.wss?uid=3Dswg21112954=
">http://www-01.ibm.com/support/docview.wss?uid=3Dswg21112954</a></div><div=
><br></div><div>$ calcinode 536870959.1</div><div><div>volume ID =A0 =A0 =
=A0 =3D 536870959</div>

<div>vnode number =A0 =A0=3D 1</div><div>inode number =A0 =A0=3D 3080193</d=
iv></div><div><br></div><div>But unfortunately &quot;ls -lRi | grep 3080193=
&quot; didn&#39;t return anything.</div><div><br></div><div>Anyone else kno=
w how to interpret the output of cmdebug? Or anything else I should be look=
ing for?</div>

<div><br></div><div>Thanks so much!</div><div><div><br></div>-- <br><div><f=
ont face=3D"&#39;courier new&#39;, monospace">Jonathan.Nilsson at uci dot e=
du</font></div><div><font face=3D"&#39;courier new&#39;, monospace">Social =
Sciences Computing Services</font></div>

<div><span style=3D"font-family:&#39;courier new&#39;, monospace">SSPB 1265=
 | 949.824.1536</span></div><br>
</div>

--0016e659f09aa7d2d504b0c450f0--