[OpenAFS] /afs area is hanging

Mark Henry mark.henry@infoprint.com
Tue, 12 May 2009 16:05:18 -0600


This is a multipart message in MIME format.
--=_alternative 00795402872575B4_=
Content-Type: text/plain; charset="US-ASCII"

> yes, that's very interesting. you've probably told us what we need to.
> you got an oops in the thread holding the lock, the machine will never
> recover. what filesystem is behind your afs cache?

We use a virtual ext3 filesystem mounted on a loopback device.  We have 
done this with many afs clients and it seems to be working.  Here is our 
fstab entry:

/AFSvirtualFS     /usr/vice/cache   ext3 defaults,loop=/dev/loop0 0 0

Mark Henry




Derrick Brashear <shadow@gmail.com> 
05/12/2009 02:58 PM

To
Mark Henry <mark.henry@infoprint.com>
cc
Felix Frank <Felix.Frank@desy.de>, openafs-info@openafs.org, Simon 
Wilkinson <sxw@inf.ed.ac.uk>
Subject
Re: [OpenAFS] /afs area is hanging






On Tue, May 12, 2009 at 4:37 PM, Mark Henry <mark.henry@infoprint.com> 
wrote:
>
>
>
>> I took the liberty to paste the interesting parts to
>> http://pastebin.com/m53578cd5. Notice the bottom, which was the 
original
>> bottom as well. Mark, you've been asked to look at dmesg before this, 
so I
>> suppose this didn't happen before you tried this call-tracing?
>
>> Besides, it would be interesting if an upgrade to 1.4.10 makes the 
problem
>> go away. Can you try that?
>
> We upgraded to 1.4.10.  We got the same errors.  Here is the cmdebug 
output:
>
> => cmdebug HOSTNAME
> ** Cache entry @ 0x172d1880 for 2.536870937.28.1820 
[afs.dev.infoprint.com]
>     locks: (none_waiting, write_locked(pid:-246839824 at:681))
>                3 bytes  DV            0  refcnt     2
>     callback 22174d40   expires 1242158668
>     0 opens     0 writers
>     mount point
>     states (0x5), stat'd, read-only

which is afs_HandleLink, called from afs_lookup. Memcache or disk
cache? (different implementations of that function)
>
> The volume # 536870937 above just happens to be root.cell but it has 
been
> different based on which dir I do the ls in.
>
>> Finally, it looks like ls and sshd got locked up trying to determine 
your
>> client machine's home cell. I can see that happening (only) if no cell 
has
>> been set at that point. The output of fs wscell would be interesting in
>> this situation, but I'm not sure wether that would lock up as well (and 
if
>> it is at all helpful).
>
> We tried the fs wscell command.  It worked fine if the fs command was 
local
> and hung if the fs was being retrieved from afs.

well, if afs is unhappy, one presumes afs is unhappy.

> Also, here is a bit of interesting output from dmesg when the system is
> hung:

ah, that'd be disk cache.

> AssertProcessEntry: pohm_main, pid=6518
> openafs: Can't open inode 95550

yes, that's very interesting. you've probably told us what we need to.
you got an oops in the thread holding the lock, the machine will never
recover. what filesystem is behind your afs cache?

> ------------[ cut here ]------------
> kernel BUG at
> 
/compile/openafs-1.4.10/src/libafs/MODLOAD-2.6.22.5-31-default-MP/osi_file.c:87!
> invalid opcode: 0000 [1] SMP
> last sysfs file: /class/scsi_host/host0/model
> CPU 6
> Modules linked in: tun iptable_filter ip_tables x_tables amk ipv6 
libafs(P)
> microcode firmware_class usbhid hid ff_memless tp_dd af_packet apparmor 
ext2
> loop dm_mod parport_pc parport bnx2 rtc_cmos rtc_core i2c_i801 rtc_lib
> ide_cd i2c_core cdrom shpchp tg3 pci_hotplug container button sg 
ehci_hcd
> uhci_hcd usbcore sd_mod ata_piix libata edd ext3 mbcache jbd fan aacraid
> scsi_mod piix ide_core thermal processor
> Pid: 6519, comm: ASCIIMast Tainted: P      N 2.6.22.5-31-default #1
> RIP: 0010:[<ffffffff8834dde9>]  [<ffffffff8834dde9>]
> :libafs:osi_UFSOpen+0x155/0x1f2
> RSP: 0000:ffff81031de299c8  EFLAGS: 00010296
> RAX: 0000000000000023 RBX: ffff81031a92b418 RCX: 0000000000000001
> RDX: ffffffff804bdfe8 RSI: 0000000000000096 RDI: ffffffff804bdfe0
> RBP: ffff81042145f000 R08: 0000000000000001 R09: ffff810001086bc0
> R10: 0000000000000046 R11: ffff81042fa74ec0 R12: ffff81042af37000
> R13: 000000000001753e R14: 000000000001753e R15: 0000000000000003
> FS:  0000000000000000(0000) GS:ffff81042ee9f1c0(0000) 
knlGS:0000000000000000
> CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> CR2: 00000000f5d6050c CR3: 000000039e0b9000 CR4: 00000000000006e0
> Process ASCIIMast (pid: 6519, threadinfo ffff81031de28000, task
> ffff81031ad770c0)
> Stack:  ffffc2000770eb90 ffffc2000770caa8 ffff81042a631000 
ffff8104172d1880
> ffff81039800839c ffffffff8832f678 0000000300000000 0000000000000003
> 0000000000000000 ffffc2000770eb90 ffff8104172d1880 0000000000000000
> Call Trace:
> [<ffffffff8832f678>] :libafs:afs_UFSHandleLink+0xf7/0x1bd
> [<ffffffff8832aadf>] :libafs:afs_lookup+0xbb2/0x115f
> [<ffffffff88352713>] :libafs:afs_linux_dentry_revalidate+0x422/0x434
> [<ffffffff883521ac>] :libafs:afs_linux_lookup+0x85/0x1ca
> [<ffffffff883188c7>] :libafs:PagInCred+0x30/0xa9
> [<ffffffff8028f972>] do_lookup+0xc4/0x1ae
> [<ffffffff8029179a>] __link_path_walk+0x36c/0xd8b
> [<ffffffff80299115>] dput+0x26/0x115
> [<ffffffff80292066>] __link_path_walk+0xc38/0xd8b
> [<ffffffff80292211>] link_path_walk+0x58/0xe0
> [<ffffffff802877e9>] do_filp_open+0x1c/0x3d
> [<ffffffff80292589>] do_path_lookup+0x1ab/0x227
> [<ffffffff80292fb9>] __path_lookup_intent_open+0x56/0x97
> [<ffffffff80293148>] open_namei+0x7a/0x674
> [<ffffffff802877e9>] do_filp_open+0x1c/0x3d
> [<ffffffff8028784e>] do_sys_open+0x44/0xc1
> [<ffffffff80220bb2>] ia32_sysret+0x0/0xa
>
>
> We still can't seem to get this system to stop hanging in the /afs area.
>
> Mark Henry
>
>
> 
_____________________________________________________________________________
> "This message and any attachments are solely for the intended recipient 
and
> may contain confidential or privileged information. If you are not the
> intended recipient, any disclosure, copying, use, or distribution of the
> information included in this message and any attachments is prohibited. 
If
> you have received this communication in error, please notify us by reply
> e-mail and immediately and permanently delete this message and any
> attachments. Thank you."
> 
_____________________________________________________________________________
>



-- 
Derrick



_____________________________________________________________________________
"This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Thank you." _____________________________________________________________________________
--=_alternative 00795402872575B4_=
Content-Type: text/html; charset="US-ASCII"


<br><tt><font size=2>&gt; yes, that's very interesting. you've probably
told us what we need to.<br>
&gt; you got an oops in the thread holding the lock, the machine will never<br>
&gt; recover. what filesystem is behind your afs cache?</font></tt>
<br>
<br><font size=2 face="sans-serif">We use a virtual ext3 filesystem mounted
on a loopback device. &nbsp;We have done this with many afs clients and
it seems to be working. &nbsp;Here is our fstab entry:</font>
<br>
<br><font size=2 face="sans-serif">/AFSvirtualFS &nbsp; &nbsp; /usr/vice/cache
&nbsp; ext3 defaults,loop=/dev/loop0 0 0</font>
<br><font size=2 face="sans-serif"><br>
Mark Henry<br>
</font>
<br>
<br>
<br>
<table width=100%>
<tr valign=top>
<td width=40%><font size=1 face="sans-serif"><b>Derrick Brashear &lt;shadow@gmail.com&gt;</b>
</font>
<p><font size=1 face="sans-serif">05/12/2009 02:58 PM</font>
<td width=59%>
<table width=100%>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">To</font></div>
<td><font size=1 face="sans-serif">Mark Henry &lt;mark.henry@infoprint.com&gt;</font>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">cc</font></div>
<td><font size=1 face="sans-serif">Felix Frank &lt;Felix.Frank@desy.de&gt;,
openafs-info@openafs.org, Simon Wilkinson &lt;sxw@inf.ed.ac.uk&gt;</font>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">Subject</font></div>
<td><font size=1 face="sans-serif">Re: [OpenAFS] /afs area is hanging</font></table>
<br>
<table>
<tr valign=top>
<td>
<td></table>
<br></table>
<br>
<br>
<br><tt><font size=2>On Tue, May 12, 2009 at 4:37 PM, Mark Henry &lt;mark.henry@infoprint.com&gt;
wrote:<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;&gt; I took the liberty to paste the interesting parts to<br>
&gt;&gt; http://pastebin.com/m53578cd5. Notice the bottom, which was the
original<br>
&gt;&gt; bottom as well. Mark, you've been asked to look at dmesg before
this, so I<br>
&gt;&gt; suppose this didn't happen before you tried this call-tracing?<br>
&gt;<br>
&gt;&gt; Besides, it would be interesting if an upgrade to 1.4.10 makes
the problem<br>
&gt;&gt; go away. Can you try that?<br>
&gt;<br>
&gt; We upgraded to 1.4.10. &nbsp;We got the same errors. &nbsp;Here is
the cmdebug output:<br>
&gt;<br>
&gt; =&gt; cmdebug HOSTNAME<br>
&gt; ** Cache entry @ 0x172d1880 for 2.536870937.28.1820 [afs.dev.infoprint.com]<br>
&gt; &nbsp; &nbsp; locks: (none_waiting, write_locked(pid:-246839824 at:681))<br>
&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3 bytes &nbsp;DV
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0 &nbsp;refcnt &nbsp; &nbsp; 2<br>
&gt; &nbsp; &nbsp; callback 22174d40 &nbsp; expires 1242158668<br>
&gt; &nbsp; &nbsp; 0 opens &nbsp; &nbsp; 0 writers<br>
&gt; &nbsp; &nbsp; mount point<br>
&gt; &nbsp; &nbsp; states (0x5), stat'd, read-only<br>
<br>
which is afs_HandleLink, called from afs_lookup. Memcache or disk<br>
cache? (different implementations of that function)<br>
&gt;<br>
&gt; The volume # 536870937 above just happens to be root.cell but it has
been<br>
&gt; different based on which dir I do the ls in.<br>
&gt;<br>
&gt;&gt; Finally, it looks like ls and sshd got locked up trying to determine
your<br>
&gt;&gt; client machine's home cell. I can see that happening (only) if
no cell has<br>
&gt;&gt; been set at that point. The output of fs wscell would be interesting
in<br>
&gt;&gt; this situation, but I'm not sure wether that would lock up as
well (and if<br>
&gt;&gt; it is at all helpful).<br>
&gt;<br>
&gt; We tried the fs wscell command. &nbsp;It worked fine if the fs command
was local<br>
&gt; and hung if the fs was being retrieved from afs.<br>
<br>
well, if afs is unhappy, one presumes afs is unhappy.<br>
<br>
&gt; Also, here is a bit of interesting output from dmesg when the system
is<br>
&gt; hung:<br>
<br>
ah, that'd be disk cache.<br>
<br>
&gt; AssertProcessEntry: pohm_main, pid=6518<br>
&gt; openafs: Can't open inode 95550<br>
<br>
yes, that's very interesting. you've probably told us what we need to.<br>
you got an oops in the thread holding the lock, the machine will never<br>
recover. what filesystem is behind your afs cache?<br>
<br>
&gt; ------------[ cut here ]------------<br>
&gt; kernel BUG at<br>
&gt; /compile/openafs-1.4.10/src/libafs/MODLOAD-2.6.22.5-31-default-MP/osi_file.c:87!<br>
&gt; invalid opcode: 0000 [1] SMP<br>
&gt; last sysfs file: /class/scsi_host/host0/model<br>
&gt; CPU 6<br>
&gt; Modules linked in: tun iptable_filter ip_tables x_tables amk ipv6
libafs(P)<br>
&gt; microcode firmware_class usbhid hid ff_memless tp_dd af_packet apparmor
ext2<br>
&gt; loop dm_mod parport_pc parport bnx2 rtc_cmos rtc_core i2c_i801 rtc_lib<br>
&gt; ide_cd i2c_core cdrom shpchp tg3 pci_hotplug container button sg ehci_hcd<br>
&gt; uhci_hcd usbcore sd_mod ata_piix libata edd ext3 mbcache jbd fan aacraid<br>
&gt; scsi_mod piix ide_core thermal processor<br>
&gt; Pid: 6519, comm: ASCIIMast Tainted: P &nbsp; &nbsp; &nbsp;N 2.6.22.5-31-default
#1<br>
&gt; RIP: 0010:[&lt;ffffffff8834dde9&gt;] &nbsp;[&lt;ffffffff8834dde9&gt;]<br>
&gt; :libafs:osi_UFSOpen+0x155/0x1f2<br>
&gt; RSP: 0000:ffff81031de299c8 &nbsp;EFLAGS: 00010296<br>
&gt; RAX: 0000000000000023 RBX: ffff81031a92b418 RCX: 0000000000000001<br>
&gt; RDX: ffffffff804bdfe8 RSI: 0000000000000096 RDI: ffffffff804bdfe0<br>
&gt; RBP: ffff81042145f000 R08: 0000000000000001 R09: ffff810001086bc0<br>
&gt; R10: 0000000000000046 R11: ffff81042fa74ec0 R12: ffff81042af37000<br>
&gt; R13: 000000000001753e R14: 000000000001753e R15: 0000000000000003<br>
&gt; FS: &nbsp;0000000000000000(0000) GS:ffff81042ee9f1c0(0000) knlGS:0000000000000000<br>
&gt; CS: &nbsp;0010 DS: 002b ES: 002b CR0: 000000008005003b<br>
&gt; CR2: 00000000f5d6050c CR3: 000000039e0b9000 CR4: 00000000000006e0<br>
&gt; Process ASCIIMast (pid: 6519, threadinfo ffff81031de28000, task<br>
&gt; ffff81031ad770c0)<br>
&gt; Stack: &nbsp;ffffc2000770eb90 ffffc2000770caa8 ffff81042a631000 ffff8104172d1880<br>
&gt; ffff81039800839c ffffffff8832f678 0000000300000000 0000000000000003<br>
&gt; 0000000000000000 ffffc2000770eb90 ffff8104172d1880 0000000000000000<br>
&gt; Call Trace:<br>
&gt; [&lt;ffffffff8832f678&gt;] :libafs:afs_UFSHandleLink+0xf7/0x1bd<br>
&gt; [&lt;ffffffff8832aadf&gt;] :libafs:afs_lookup+0xbb2/0x115f<br>
&gt; [&lt;ffffffff88352713&gt;] :libafs:afs_linux_dentry_revalidate+0x422/0x434<br>
&gt; [&lt;ffffffff883521ac&gt;] :libafs:afs_linux_lookup+0x85/0x1ca<br>
&gt; [&lt;ffffffff883188c7&gt;] :libafs:PagInCred+0x30/0xa9<br>
&gt; [&lt;ffffffff8028f972&gt;] do_lookup+0xc4/0x1ae<br>
&gt; [&lt;ffffffff8029179a&gt;] __link_path_walk+0x36c/0xd8b<br>
&gt; [&lt;ffffffff80299115&gt;] dput+0x26/0x115<br>
&gt; [&lt;ffffffff80292066&gt;] __link_path_walk+0xc38/0xd8b<br>
&gt; [&lt;ffffffff80292211&gt;] link_path_walk+0x58/0xe0<br>
&gt; [&lt;ffffffff802877e9&gt;] do_filp_open+0x1c/0x3d<br>
&gt; [&lt;ffffffff80292589&gt;] do_path_lookup+0x1ab/0x227<br>
&gt; [&lt;ffffffff80292fb9&gt;] __path_lookup_intent_open+0x56/0x97<br>
&gt; [&lt;ffffffff80293148&gt;] open_namei+0x7a/0x674<br>
&gt; [&lt;ffffffff802877e9&gt;] do_filp_open+0x1c/0x3d<br>
&gt; [&lt;ffffffff8028784e&gt;] do_sys_open+0x44/0xc1<br>
&gt; [&lt;ffffffff80220bb2&gt;] ia32_sysret+0x0/0xa<br>
&gt;<br>
&gt;<br>
&gt; We still can't seem to get this system to stop hanging in the /afs
area.<br>
&gt;<br>
&gt; Mark Henry<br>
&gt;<br>
&gt;<br>
&gt; _____________________________________________________________________________<br>
&gt; &quot;This message and any attachments are solely for the intended
recipient and<br>
&gt; may contain confidential or privileged information. If you are not
the<br>
&gt; intended recipient, any disclosure, copying, use, or distribution
of the<br>
&gt; information included in this message and any attachments is prohibited.
If<br>
&gt; you have received this communication in error, please notify us by
reply<br>
&gt; e-mail and immediately and permanently delete this message and any<br>
&gt; attachments. Thank you.&quot;<br>
&gt; _____________________________________________________________________________<br>
&gt;<br>
<br>
<br>
<br>
-- <br>
Derrick<br>
</font></tt>
<br>

<BR>
_____________________________________________________________________________<BR>
"This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Thank you." _____________________________________________________________________________<BR>

--=_alternative 00795402872575B4_=--