[OpenAFS] backup->buserver 1.4.2 hangs/loops on gettimeofday

John W. Sopko Jr. sopko@cs.unc.edu
Thu, 16 Nov 2006 10:33:54 -0500


Found the problem where the "backup" command hung on startup.  During the
rpm upgrade from 1.4.1 to 1.4.2 the following link got made. Could not
find what created the link but it got created after updating to
the openafs 1.4.2 rpms:

/usr/vice/etc/CellServDB -> /usr/afs/etc/CellServDB

When /etc/init.d/openafs-client runs it builds a new CellServDB. As
we know the servers /usr/afs/etc/CellServDB should only contain the
db servers for the cell. The buserver got confused with all the server
entries and our cell was not in this file, we use afsdb dns records.

Updating the servers CellServDB file with our db servers and restarting
the buserver solved the backup hang issue.

I looked at the rpms post scripts with "rpm --scripts -qp openafs*.rpm"
but could not find what created the links. I am %99 sure it was nothing
else I ran. Strange.

Anyway there is still a problem with the 1.4.2 buserver and butc.
If I run 1.4.2 buserver/butc butc still segfaults. If I run the 1.4.1
buserver/butc the dump runs fine. I proved this on both our backup
servers. I also tried the 1.4.1 butc with the 1.4.2 buserver and still
seg faulted.

I am using the non-threaded butc as the threaded butc is known to segfault
on redhat-3.

Here is an strace of butc from the bad and good dump that may help.
I will be happy to try anything to get this solved. If I do not here
back in a few days I will file a bug.

The strace command I used shows child processes and timestamps:

trace -f -t -o /tmp/butc.1.4.12.no-thread.ok /usr/sbin/butc.1.4.1.no-threads 
-port 4 -localauth

The /usr/afs/logs/BackupLog did not show any clues.

The traces are at:

http://www.cs.unc.edu/~sopko/backup/butc.1.4.1.no-thread.ok
http://www.cs.unc.edu/~sopko/backup/butc.1.4.2.no-thread.segfault

butc segfaults just after the 4th occurance of "exit_group(0)"

The butc segfault dump does work for sometime. We append dumps to each other
(dump -append), the backup dump command prints out the volumes to be
dumped to tape, the tape gets accessed for some time. I believe this is
to find the last dump on the tape. Then butc segfaults.

-- 
John W. Sopko Jr.               University of North Carolina
email: sopko AT cs.unc.edu      Computer Science Dept., CB 3175
Phone: 919-962-1844             Sitterson Hall; Room 044
Fax:   919-962-1799             Chapel Hill, NC 27599-3175