[OpenAFS-devel] question: binary interface to kernel module (RHEL6.2/6.3, openafs 1.6.1)?

Stephan Wiesand stephan.wiesand@desy.de
Thu, 30 Aug 2012 15:19:16 +0200


On Aug 29, 2012, at 19:12 , Marc Dionne <marc.c.dionne@gmail.com> wrote:

> On Wed, Aug 29, 2012 at 11:21 AM, Stephan Wiesand
> <stephan.wiesand@desy.de> wrote:
>> Hi All,
>>=20
>> this is just a question. I'm not asserting an openafs bug.
>>=20
>> Since SL6, we have we have been using "kABI tracking kmods" for =
installing the OpenAFS kernel module on clients. For full information on =
this mechanism, see =
http://people.redhat.com/jcm/el6/dup/docs/dup_book.pdf . In short, you =
only have to compile and install the module once, and it will be used =
with future kernels as long as it doesn't use parts of the ABI that =
changed.
>>=20
>> Trying this may have been stupid in the first place. If so, happy =
bashing :-)
>>=20
>> But in practice, it has worked perfectly for a long time. The modules =
built against the EL6 GA kernel (2.6.32-71.el6) work fine with every =
released kernel up to the latest EL6.2 kernels (2.6.32-220.23.1.el6), on =
both 32-bit and 64-bit systems.
>>=20
>> But with the EL6.3 update (2.6.32-279.el5), something changed that =
broke at least the interface to the 32-bit module. The symptoms are =
reads getting stuck at the very beginning, except for very small files. =
The reads can be interrupted, but the client can no longer be stopped =
cleanly.
>=20
> Anything interesting in the syslog when this occurs?

No. When shutting down afterwards, it gets as far as "WARM shutting down =
of: vcaches..." before umount and afsd get stuck in D state. but that's =
all I see.

I fstraced a read getting stuck. This is what it looks like:

time 196.758223, pid 1608: Lookup adp 0xeb949300 name =
openafs.SLx-1.6.0-89.pre1.src.rpm fid (178:536891922.210.509), code=3D0=20=

time 196.758632, pid 1608: Analyze RPC op 2 conn 0xebea2320 code 0x0 =
user 0x413f6096=20
time 196.758635, pid 1608: ProcessFS vp 0xe938ed40 old len (0x0, 0x0) =
new len (0x0, 0x11ed320)=20
time 196.758637, pid 1608: Getattr vp 0xe938ed40 len (0x0, 0x11ed320)=20
time 196.758991, pid 1608: Getattr vp 0xe938ed40 len (0x0, 0x11ed320)=20
time 196.759225, pid 1608: Access vp 0xe938acc0 mode 0x40 len (0x0, =
0x800)=20
time 196.759229, pid 1608: Access vp 0xe938a040 mode 0x40 len (0x0, =
0x11)=20
time 196.759231, pid 1608: GetdCache vp 0xe938a2c0 dcache 0xedc49000 =
dcache low-version 0x463c, vcache low-version 0x463c=20
time 196.759231, pid 1608: GetdCache tlen 0x800 flags 0x1 abyte (0x0, =
0x0) Position (0x0, 0x0)=20
time 196.759232, pid 1608: Lookup adp 0xe938a2c0 name packages fid =
(178:536870916.2.17768), code=3D0=20
time 196.759233, pid 1608: Mount point is to vp 0xe938a540 fid =
(178:536870916.2.17768)=20
time 196.759235, pid 1608: Access vp 0xe938a7c0 mode 0x40 len (0x0, =
0x800)=20
time 196.759236, pid 1608: Access vp 0xeb949d00 mode 0x40 len (0x0, =
0x2000)=20
time 196.759237, pid 1608: Access vp 0xeb949800 mode 0x40 len (0x0, =
0x12800)=20
time 196.759237, pid 1608: Access vp 0xeb949300 mode 0x40 len (0x0, =
0x2800)=20
time 196.759238, pid 1608: Access vp 0xe938ed40 mode 0x100 len (0x0, =
0x11ed320)=20
time 196.759241, pid 1608: Open 0xe938ed40 flags 0x8000=20
time 196.759242, pid 1608: Open 0xe938ed40 flags 0xf423f=20
time 196.759248, pid 1608: Getattr vp 0xe938ed40 len (0x0, 0x11ed320)=20
time 196.759394, pid 1608: Iread ip xe938ed40 pos (0x0, 0x0) count =
0x8000 code 1869f=20

NB rxdebug, cmdebug, fs getcache etc. all still work.

>> Using a module built against the 6.3 kernel with pre-6.3 ones has =
worse effects. BUGs, panics, spontaneous reboots.
>>=20
>> All this was only observed on 32-bit systems, and only if the cache =
is on ext4. I have a suspicion that it might be related to a change =
described here: =
http://joejulian.name/blog/glusterfs-bit-by-ext4-structure-change/ . =
Quote: << a patch against ext4 to "return 32/64-bit dir name hash =
according to usage type". Prior to that, ext2/3/4 would return a 32-bit =
hash value from telldir()/seekdir() [. . .] That patch was for kernel =
v3.3-rc2. To make things more fun, [. . .] merged in that patch in =
2.6.32-268.el6 >>
>>=20
>> The direct link to the patch is =
http://git.kernel.org/?p=3Dlinux/kernel/git/stable/linux-stable.git;a=3Dco=
mmit;h=3Dd1f5273e9adb40724a85272f248f210dc4ce919a .
>>=20
>> Does anyone familiar withe the openafs module's inner workings see =
whether that patch would have the effects described above, on 32-bit =
systems only?
>>=20
>> Thanks a lot in advance for any insights.
>>=20
>>        Stephan
>=20
> Offhand I don't see anything in that change that should affect
> openafs.  Within the kernel module each cache file is looked up up
> individually with a full path that it receives from afsd - the
> directory scanning is done by afsd in user space using readdir.  The
> lookup returns a dentry which is then converted to a file handle by
> the underlying file system's own conversion function.  The file
> handles for all cache files are stored in memory by the module.  When
> a file is used, the file handle is converted to a dentry with the fs's
> conversion function, and the file is opened with dentry_open.

Thanks for the explanation.

> Any other changes to ext4 in that update?

Yes, quite a few. Alas, the patches are no longer available separately, =
and practically all the BZs are private. But I'll run a diff later.

> Does the module work correctly on this system, with ext4, if it is =
recompiled?

Yes, if it is built against a 6.3 (-279) kernel. Rebuilding against an =
old kernel with the current toolchchain makes no difference.

It gets weirder: I can't reproduce the problem with an ext4 cache =
filesystem created with mkfs.ext4 on the running system. Only with =
filesystems created by the installer (SL6.2 is confirmed yet). The fsck =
doesn't find anything wrong with the fs.

I guess it's an ext4 issue in EL6. But I'd still feel better if I =
understood what's going on.

Thanks a lot for your help
	Stephan

--=20
Stephan Wiesand
DESY - DV -
Platanenallee 6
15732 Zeuthen, Germany