[OpenAFS-devel] Salvager (from openafs-server-1.6.5-1.el6.x86_64) segmentation fault

Harald Barth haba@kth.se
Tue, 13 Aug 2013 11:23:07 +0200 (CEST)


After our fileserver fell over, the salvager had to run and it fell over with:

Program terminated with signal 6, Aborted.
#0  0x00007f2b3fe328a5 in raise () from /lib64/libc.so.6
64

gdb on the core gives:

(gdb) where
#0  0x00007f2b3fe328a5 in raise () from /lib64/libc.so.6
#1  0x00007f2b3fe34085 in abort () from /lib64/libc.so.6
#2  0x0000000000424851 in osi_Panic (
    msg=0x43ef88 "assertion failed: %s, file: %s, line: %d\n") at rx_user.c:251
#3  0x000000000042486e in osi_AssertFailU (
    expr=0xed1a <Address 0xed1a out of bounds>, 
    file=0x6 <Address 0x6 out of bounds>, line=-1) at rx_user.c:261
#4  0x000000000040a29b in SalvageVolume (salvinfo=0x7fffd0c150b0, 
    rwIsp=<value optimized out>, alinkH=0x17125b0) at vol-salvage.c:3986
#5  0x000000000040cb2d in DoSalvageVolumeGroup (
    salvinfo=<value optimized out>, isp=0x1710450, nVols=1)
    at vol-salvage.c:2092
#6  0x000000000040db85 in SalvageFileSys1 (partP=<value optimized out>, 
    singleVolumeNumber=0) at vol-salvage.c:937
#7  0x000000000040e1c5 in SalvageFileSysParallel (partP=0x16ebbe0)
    at vol-salvage.c:667
#8  0x000000000040ee2f in handleit (as=<value optimized out>, 
    arock=<value optimized out>) at ./salvager.c:375
#9  0x0000000000410687 in cmd_Dispatch (argc=7, argv=0x16e74b0) at cmd.c:905
#10 0x000000000040e9ce in main (argc=6, argv=0x7fffd0c15cc8)
    at ./salvager.c:534


(gdb) up
#1  0x00007f2b3fe34085 in abort () from /lib64/libc.so.6
(gdb) up
#2  0x0000000000424851 in osi_Panic (
    msg=0x43ef88 "assertion failed: %s, file: %s, line: %d\n") at rx_user.c:251
251         afs_abort();
(gdb) up
#3  0x000000000042486e in osi_AssertFailU (
    expr=0xed1a <Address 0xed1a out of bounds>, 
    file=0x6 <Address 0x6 out of bounds>, line=-1) at rx_user.c:261
261         osi_Panic("assertion failed: %s, file: %s, line: %d\n", expr,
(gdb) up
#4  0x000000000040a29b in SalvageVolume (salvinfo=0x7fffd0c150b0, 
    rwIsp=<value optimized out>, alinkH=0x17125b0) at vol-salvage.c:3986
3986                        osi_Assert(Delete(&dh, "..") == 0);
(gdb) list
3981                        SetSalvageDirHandle(&dh, vid, salvinfo->fileSysDevice,
3982                                            salvinfo->vnodeInfo[class].inodes[v],
3983                                            &salvinfo->VolumeChanged);
3984                        pa.Vnode = LFVnode;
3985                        pa.Unique = LFUnique;
3986                        osi_Assert(Delete(&dh, "..") == 0);
3987                        osi_Assert(Create(&dh, "..", &pa) == 0);
3988
3989                        /* The original parent's link count was decremented above.
3990                         * Here we increment the new parent's link count.
(gdb) 


I assume the salvager tries to delete the directory entry .. and create it again new.

Looks to me like FindItem() in dir.c:Delete() came up empty handed, we got ENOENT which
did Abort().

Do you think it's safe to change row 3986 to something less dramatic that Abort() or
do you have a better suggestion?

Harald.