[OpenAFS] Re: rough timeline for 1.4.x

Joseph Kiniry kiniry@acm.org
Tue, 21 Dec 2004 15:16:57 +0000


As I am in a situation similar to Jason, and the OpenAFS developers  
want feedback, here are my recent results.

I am attempting to use both the pre-build Fedora Core 3 RPMs of OpenAFS  
1.3.76 as well as my own build to set up new servers here at UCD.  I'm  
running on the latest/greatest stock FC3 kernel (Linux correct  
2.6.9-1.681_FC3 #1 Thu Nov 18 15:10:10 EST 2004 i686 i686 i386  
GNU/Linux).  The systems are all P3s with big disks and moderate  
amounts of memory (512MB).

My previous mails to this list focused on bug fixes for various seg  
faults in AFS binaries.  It seems those issues are now fixed in recent  
releases, but I still experience more segfault problems.

In particular, I just went through the full install process again and  
got as far as checking my mounts for my cell's root before experiencing  
my first segfault:

correct:~# fs examine /afs/
File /afs/ (536870912.1.1) contained in volume 536870912
Volume status for vid = 536870912 named root.afs
Current disk quota is 5000
Current blocks used are 4
The partition has 27299690 blocks available out of 27377624

correct:~# fs examine /afs/cs.ucd.ie
Segmentation fault

An strace reveals the segfault happens at/during an ioctl call:
...
open("/root/.AFSSERVER", O_RDONLY)      = -1 ENOENT (No such file or  
directory)
open("/.AFSSERVER", O_RDONLY)           = -1 ENOENT (No such file or  
directory)
open("/proc/fs/openafs/afs_ioctl", O_RDWR) = 3
ioctl(3, CAPI_REGISTER or SNDCTL_COPR_LOAD <unfinished ...>
+++ killed by SIGSEGV +++

And /var/log/messages contains the following kernel fault:
Dec 21 15:02:32 correct kernel:  <1>Unable to handle kernel NULL  
pointer dereference at virtual address 000000\18
Dec 21 15:02:32 correct kernel:  printing eip:
Dec 21 15:02:32 correct kernel: 021c55bc
Dec 21 15:02:32 correct kernel: *pde = 00000000
Dec 21 15:02:32 correct kernel: Oops: 0000 [#6]
Dec 21 15:02:32 correct kernel: Modules linked in: libafs(U) md5 ipv6  
iptable_filter ip_tables dm_mod button b\attery ac uhci_hcd  
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer  
snd_page_alloc game\port snd_mpu401_uart snd_rawmidi snd_seq_device snd  
soundcore 3c59x floppy ext3 jbd
Dec 21 15:02:32 correct kernel: CPU:    0
Dec 21 15:02:32 correct kernel: EIP:    0060:[<021c55bc>]    Tainted: P  
   VLI
Dec 21 15:02:32 correct kernel: EFLAGS: 00010246   (2.6.9-1.681_FC3)
Dec 21 15:02:32 correct kernel: EIP is at inode_has_perm+0x38/0x54
Dec 21 15:02:32 correct kernel: eax: 00000000   ebx: 1f840db8   ecx:  
00000000   edx: 00000000
Dec 21 15:02:32 correct kernel: esi: 22be9000   edi: 1f840df0   ebp:  
04482d60   esp: 1f840db4
Dec 21 15:02:32 correct kernel: ds: 007b   es: 007b   ss: 0068
Dec 21 15:02:32 correct kernel: Process fs (pid: 2847,  
threadinfo=1f840000 task=05a3f330)
Dec 21 15:02:32 correct kernel: Stack: 00100000 00000001 00000000  
00000000 00000000 22be9000 00000000 00000000\
Dec 21 15:02:32 correct kernel:        00000000 00000000 00000000  
00000000 00000000 00000000 00000000 05a36090\
Dec 21 15:02:32 correct kernel:        1f840dfc 0235f460 00000000  
22be9000 00000001 021c71a5 00000000 0235e720\
Dec 21 15:02:32 correct kernel: Call Trace:
Dec 21 15:02:32 correct kernel:  [<021c71a5>]  
selinux_inode_permission+0x9d/0xa2
Dec 21 15:02:32 correct kernel:  [<021750ce>] permission+0x41/0x46
Dec 21 15:02:32 correct kernel:  [<0217595b>]  
link_path_walk+0x120/0x1009
Dec 21 15:02:32 correct kernel:  [<0215ef01>]  
copy_str_fromuser_size+0x3d/0x56
Dec 21 15:02:32 correct kernel:  [<02176abf>] path_lookup+0xff/0x12f
Dec 21 15:02:32 correct kernel:  [<02176c03>] __user_walk+0x21/0x51
Dec 21 15:02:32 correct kernel:  [<22bb9abd>] osi_lookupname+0x21/0x77  
[libafs]
Dec 21 15:02:32 correct kernel:  [<22bc28d6>]  
afs_syscall_pioctl+0x86/0xf8 [libafs]
Dec 21 15:02:32 correct kernel:  [<22bbfe07>] afs_syscall+0x16d/0x2b5  
[libafs]
Dec 21 15:02:32 correct kernel:  [<0215222e>] follow_page_pte+0xec/0xfd
Dec 21 15:02:32 correct kernel:  [<22bba47e>] afs_ioctl+0x41/0x4f  
[libafs]
Dec 21 15:02:32 correct kernel:  [<0217a4f6>] file_ioctl+0xf2/0x105
Dec 21 15:02:32 correct kernel:  [<0217a77f>] sys_ioctl+0x276/0x337
Dec 21 15:02:32 correct kernel: Code: <3>Debug: sleeping function  
called from invalid context at include/linux\/rwsem.h:43
Dec 21 15:02:32 correct kernel: in_atomic():0[expected: 0],  
irqs_disabled():1
Dec 21 15:02:32 correct kernel:  [<0211cbcb>] __might_sleep+0x7d/0x8a
Dec 21 15:02:32 correct kernel:  [<0215e726>] rw_vm+0x20e/0x47a
Dec 21 15:02:32 correct kernel:  [<021c5591>] inode_has_perm+0xd/0x54
Dec 21 15:02:32 correct kernel:  [<021c5591>] inode_has_perm+0xd/0x54
Dec 21 15:02:32 correct kernel:  [<0215ee70>] get_user_size+0x30/0x57
Dec 21 15:02:32 correct kernel:  [<021c5591>] inode_has_perm+0xd/0x54
Dec 21 15:02:32 correct kernel:  [<0210682b>] show_registers+0x109/0x15e
Dec 21 15:02:32 correct kernel:  [<02106a2f>] die+0x14a/0x241
Dec 21 15:02:32 correct kernel:  [<0211937e>] do_page_fault+0x0/0x511
Dec 21 15:02:32 correct kernel:  [<0211937e>] do_page_fault+0x0/0x511
Dec 21 15:02:32 correct kernel:  [<02119733>] do_page_fault+0x3b5/0x511
Dec 21 15:02:32 correct kernel:  [<021c55bc>] inode_has_perm+0x38/0x54
Dec 21 15:02:32 correct kernel:  [<021c3f3a>]  
avc_has_perm_noaudit+0x8d/0xda
Dec 21 15:02:32 correct kernel:  [<02185f20>] update_atime+0x4d/0x90
Dec 21 15:02:32 correct kernel:  [<021c3f3a>]  
avc_has_perm_noaudit+0x8d/0xda
Dec 21 15:02:32 correct kernel:  [<22ba27f5>] afs_GetVolume+0x19/0x51  
[libafs]
Dec 21 15:02:32 correct kernel:  [<22b9512b>]  
afs_CopyOutAttrs+0x1df/0x1e5 [libafs]
Dec 21 15:02:32 correct kernel:  [<0211937e>] do_page_fault+0x0/0x511
Dec 21 15:02:32 correct kernel:  [<021c55bc>] inode_has_perm+0x38/0x54
Dec 21 15:02:32 correct kernel:  [<021c71a5>]  
selinux_inode_permission+0x9d/0xa2
Dec 21 15:02:32 correct kernel:  [<021750ce>] permission+0x41/0x46
Dec 21 15:02:32 correct kernel:  [<0217595b>]  
link_path_walk+0x120/0x1009
Dec 21 15:02:32 correct kernel:  [<0215ef01>]  
copy_str_fromuser_size+0x3d/0x56
Dec 21 15:02:32 correct kernel:  [<02176abf>] path_lookup+0xff/0x12f
Dec 21 15:02:32 correct kernel:  [<02176c03>] __user_walk+0x21/0x51
Dec 21 15:02:32 correct kernel:  [<22bb9abd>] osi_lookupname+0x21/0x77  
[libafs]
Dec 21 15:02:32 correct kernel:  [<22bc28d6>]  
afs_syscall_pioctl+0x86/0xf8 [libafs]
Dec 21 15:02:32 correct kernel:  [<22bbfe07>] afs_syscall+0x16d/0x2b5  
[libafs]
Dec 21 15:02:32 correct kernel:  [<0215222e>] follow_page_pte+0xec/0xfd
Dec 21 15:02:32 correct kernel:  [<22bba47e>] afs_ioctl+0x41/0x4f  
[libafs]
Dec 21 15:02:32 correct kernel:  [<0217a4f6>] file_ioctl+0xf2/0x105
Dec 21 15:02:32 correct kernel:  [<0217a77f>] sys_ioctl+0x276/0x337
Dec 21 15:02:32 correct kernel:  Bad EIP value.

On 10 Dec, 2004, at 21:37, Jason McCormick wrote:

> --On Monday, December 06, 2004 07:48:52 PM +0100 Jeffrey Altman
> <jaltman@columbia.edu> wrote:
>
>> The thing which is preventing the release of 1.3.7x as a stable 1.4
>> tree is lack of deployment and testing by users.  There has been very
>> little feedback both positive or negative on the existing releases.
>> Without this feedback it is very difficult for us to know whether or
>> not it is ready.
>
>   I'd been holding back our feedback because 1.3.75 was imminent and  
> some
> of the fixes listed we though might fix our problems.  We've done  
> testing
> with 1.3.74 and 1.3.75.  The clients are all Fedora Core 3 w/ patched
> kernels to provide sys_call_table[].  We are experiencing the following
> problems:
>
>
>   * Inability to unmount /usr/vice/cache (or / if it's not a separate
> partition).  This is 100% repeatable on all FC3 machines.  The  
> following
> steps will always create this problem:
>
>       - Stop all processes and logout all users of AFS
>       - Stop all AFS processes and unload libafs kernel module
>       - lsof | grep -i afs reports nothing open
>       - umount /usr/vice/cache
>
> This will always result in an error that /usr/vice/cache is busy:
>
>       # umount /usr/vice/cache
>       umount: /usr/vice/cache: device is busy
>       umount: /usr/vice/cache: device is busy
>
>   * Accessing an AFS volume over our VPN results in an immediate kernel
> panic.  The panic message reports many "Unable to handle kernel NULL
> pointer deference at virtual address" errors followed by "Recursive  
> die()
> failure, output suppressed" and "<0>Kernel panic - not syncing: Fatal
> exception in interrupt".  This is present only on 1 of 2 laptops  
> running
> FC3, but is 100% repeatable on the failing laptop.
>
>   * Copying large files (~450Mb0 into AFS from non-AFS partitions  
> results
> in a kernel oops.  The error reported is:
>
>    rxi_Start: xmit list overflowed<1>Unable to handle kernel paging  
> request
> at virtual address ffffffff
>
> This problem is also 100% repeatable.  'fs getcache' does not report  
> that
> the cache is full.  I've attached a file gti-largefile-copy-oops.txt  
> that
> is the "soft" kernel oops.
>
>   * Random cache consistency problems.  A file will be present in the
> filesystem and viewable on other machines but not on the FC3 host.  fs
> flush does not always solve this problem however another client  
> operating
> on the same directory (i.e. touch hi) seems to unstick the client.  We  
> do
> have one test case that seems to always generate this problem, but  
> it's not
> very portable for other to test as it requires our internal package
> management software.  Rudy Maceyko is going to test this with 1.3.75
> shortly.
>
>   These are our current problems with the 1.3.7x series.  We have not
> tested 1.3.7x on any other Linux release because we're focusing on  
> moving
> forward with Fedora 3 and RHEL 4 preparations.  So I can't speak to  
> these
> problems existing on, for example, FC1.
>
>   We are building the RPMs with a modified specfile.  We're working to
> merge our changes back into the mainline spec file and provide that to  
> the
> community.  I've attached all of the patches we're applying to the  
> source
> tree since they're all small.  Their descriptions are:
>
>   openafs-1.2.11-no_old_gid_t.patch - Support for AMD 64
>
>   openafs-1.2.11-res_search.patch - resolver patch
>
>   openafs-1.3.75-afskvers-autoconf-fix.patch - Fix --with-afs-system
>
>   26syscall.patch - Hard-sets the build process to use sys_call_table
>
>   afs.initd.patch - Removes modload logic in favor of symlinks
>                     to /lib/modules
>
>   openafs-krb5-2.0-afsconf.patch - Fixes call to afsconf_AddKey()
>                                    for afs-krb5
>
> I've held off reporting this for a little bit because I've not had  
> time to
> properly test or debug any of these.  Let me know what we can do to  
> further
> debug these problems.
>
> --  
> Jason McCormick
> CERT Infrastructure Team
> jasonmc@cert.org ** 412-268-7961
> <gti-largefile-copy- 
> oops.txt><26syscall.patch><afs.initd.patch><openafs-1.2.11- 
> no_old_gid_t.patch><openafs-1.2.11-res_search.patch><openafs-1.3.74- 
> admin_tools.klog.patch><openafs-krb5-2.0-afsconf.patch><openafs 
> -1.3.75-afskvers-autoconf-fix.patch>