[OpenAFS] Crash in volserver when restoring volume from backup.

Hartmut Reuter reuter@rzg.mpg.de
Wed, 26 Aug 2009 17:04:07 +0200


Anders Magnusson wrote:
> Hartmut Reuter wrote:
>> Anders Magnusson wrote:
>>  
>>> Hi,
>>>
>>> I have a problem that I need some advice on how to go on with.
>>>
>>> I have a volume dump file, but when trying to read it back volserver
>>> crashes.
>>>
>>> The dump was generated under 1.4.8, and the volserver segv appears with
>>> both 1.4.8 and 1.4.11.
>>>
>>> VolserLog.old says:
>>> Wed Aug 26 14:56:34 2009 Starting AFS Volserver 2.0
>>> (/usr/afs/bin/volserver -p 16)
>>> Wed Aug 26 15:00:12 2009 1 Volser: CreateVolume: volume 537998421
>>> (students.waqazi-4) created
>>>
>>> BosLog says:
>>> Wed Aug 26 15:00:14 2009: fs:vol exited on signal 11
>>>
>>> Any hints where to go from here?  I can provide the dump file on
>>> request, but since it's a
>>> student home directory I don't want it to be public.
>>>
>>> -- Ragge
>>>     
>>
>> Attach the volserver with gdb before running the command. Then you may
>> be able to see where it crashes and why.
>>
>>   
> Done:
> # gdb /usr/afs/bin/volserver
> GNU gdb Fedora (6.8-27.el5)
> Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (gdb) symb
> Discard symbol table from `/usr/afs/bin/volserver'? (y or n) y
> No symbol file now.
> (gdb) help file
> Use FILE as program to be debugged.
> It is read for its symbols, for getting the contents of pure memory,
> and it is the program executed when you use the `run' command.
> If FILE cannot be found as specified, your execution directory path
> ($PATH) is searched for a command of that name.
> No arg means to have no executable file and no symbols.
> (gdb) help symb
> Load symbol table from executable file FILE.
> The `file' command can also load symbol tables, as well as setting the file
> to execute.
> (gdb) symb volserver.debug
> Reading symbols from /usr/afs/bin/volserver.debug...done.
> (gdb) attach 30720
> Attaching to program: /usr/afs/bin/volserver, process 30720
> Reading symbols from /lib64/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 0x2ab6a9d3be90 (LWP 30720)]
> [New Thread 0x4dd02940 (LWP 30740)]
> [New Thread 0x4d301940 (LWP 30739)]
> [New Thread 0x4c900940 (LWP 30738)]
> [New Thread 0x4beff940 (LWP 30737)]
> [New Thread 0x4b4fe940 (LWP 30736)]
> [New Thread 0x4aafd940 (LWP 30735)]
> [New Thread 0x4a0fc940 (LWP 30734)]
> [New Thread 0x496fb940 (LWP 30733)]
> [New Thread 0x48cfa940 (LWP 30732)]
> [New Thread 0x482f9940 (LWP 30731)]
> [New Thread 0x478f8940 (LWP 30730)]
> [New Thread 0x46ef7940 (LWP 30729)]
> [New Thread 0x464f6940 (LWP 30728)]
> [New Thread 0x45af5940 (LWP 30727)]
> [New Thread 0x450f4940 (LWP 30726)]
> [New Thread 0x446f3940 (LWP 30725)]
> [New Thread 0x43cf2940 (LWP 30724)]
> [New Thread 0x432f1940 (LWP 30723)]
> [New Thread 0x428f0940 (LWP 30722)]
> [New Thread 0x41eef940 (LWP 30721)]
> Loaded symbols for /lib64/libpthread.so.0
> Reading symbols from /lib64/libresolv.so.2...done.
> Loaded symbols for /lib64/libresolv.so.2
> Reading symbols from /lib64/libc.so.6...done.
> Loaded symbols for /lib64/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x0000003c13c0a899 in pthread_cond_wait@@GLIBC_2.3.2 ()
>   from /lib64/libpthread.so.0
> (gdb) c
> Continuing.
> [New Thread 0x4e703940 (LWP 30748)]
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x4aafd940 (LWP 30735)]
> 0x0000003c13078d60 in strlen () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003c13078d60 in strlen () from /lib64/libc.so.6
> #1  0x0000000000430092 in afs_vsnprintf (p=0x4aafc3ba "4BF+0", avail=999,
>    fmt=<value optimized out>, ap=0x4aafc7c0) at ../util/snprintf.c:395
> #2  0x0000000000416a60 in vFSLog (
>    format=0x467838 "1 Volser: ReadVnodes: IH_CREATE: %s - restore
> aborted\n",

This is the message the volserver wanted to write into the VolserLog.
Unfortunately afs_error_message(errno) didn't return a usable string so
it came to the crash.

However, if you repeat that experiment you may say "up 4" to get to
dumpstuff.c:1214 and then you can do a

"print *vnode"
and
"print vnodeNumber"

to see which vnode it is.

A possible reason why IH_CREATE could fail is that you already tried
this so many times that in the linktable of the volume all tags for this
vnode number are already in use.

Because of the crashes the volserver didn't remove the remains of the
unsuccessful restores.

>    args=0x4) at ../util/serverLog.c:135
> #3  0x000000000042550e in Log (
>    format=0x1311e7ec <Address 0x1311e7ec out of bounds>) at
> ../vol/common.c:41
> #4  0x000000000040e60c in RestoreVolume (call=<value optimized out>,
>    avp=0x2aaaac03e780, incremental=<value optimized out>,
>    cookie=<value optimized out>) at ../volser/dumpstuff.c:1214
> #5  0x00000000004067f4 in VolRestore (acid=0x41ff510,
>    atrans=<value optimized out>, aflags=1, cookie=0x4aafd000)
>    at ../volser/volprocs.c:1406
> #6  0x0000000000406850 in SAFSVolRestore (acid=0x1311e7ec, atrans=999,
>    aflags=319940588, cookie=0x4) at ../volser/volprocs.c:1378
> #7  0x0000000000413822 in AFSVolExecuteRequest (z_call=0x41ff510)
>    at ../volser/volint.ss.c:104
> #8  0x000000000044d0de in rxi_ServerProc (threadID=5, newcall=0x3e7,
>    socketp=0x4aafd10c) at ../rx/rx.c:1445
> #9  0x000000000042afe4 in rx_ServerProc (dummy=<value optimized out>)
>    at ../rx/rx_pthread.c:303
> #10 0x000000000042a6a8 in server_entry (argp=0x1311e7ec)
>    at ../rx/rx_pthread.c:101
> #11 0x0000003c13c06367 in start_thread () from /lib64/libpthread.so.0
> #12 0x0000003c130d30ad in clone () from /lib64/libc.so.6
> (gdb)
> 
> 
>> You also could try to analyze the dump file with dumptool which is built
>> under sudirectory src/tests.
>>   
> # ./dumptool -v /tmp/students.waqazi-4
> Dump is for volume 537980786 (students.waqazi-4.backup)
> Volume information:
>        id = 537980786
>        parent id = 537980784
>        name = students.waqazi-4.backup
>        flags = inService blessed
>        uniquifier = 1721
>        Creation date = Fri Aug  7 12:38:37 2009
>        Last access date = (null)       Last update date = Thu Aug  6
> 16:32:52 2009
>        Volume owner = 21361
> Scanning vnodes (this may take a while)
> Unknown dump tag "2"
> First vnode scan failed, aborting

Tag '2' (character "2") is indeed an unknown tag. So perhaps the dump
file is not really usable.

Hartmut

> 
> More hints?
> 
> -- Ragge
> 
> 
> 
>> Hartmut
>>
>>  
>>> _______________________________________________
>>> OpenAFS-info mailing list
>>> OpenAFS-info@openafs.org
>>> https://lists.openafs.org/mailman/listinfo/openafs-info
>>>     
>>
>>
>>   
> 


-- 
-----------------------------------------------------------------
Hartmut Reuter                  e-mail 		reuter@rzg.mpg.de
			   	phone 		 +49-89-3299-1328
			   	fax   		 +49-89-3299-1301
RZG (Rechenzentrum Garching)   	web    http://www.rzg.mpg.de/~hwr
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------