[OpenAFS] 1.6.2 buserver + butc

Prasad Dharmasena pkd@glue.umd.edu
Tue, 26 Mar 2013 20:04:15 -0400 (EDT)


Hello,

We recently upgraded our OpenAFS servers to 1.6.2, all running on 
Solaris 10 (Generic_147440-27 sun4v sparc).

Since the buserver upgrade, backups have been failing for various 
servers / various partitions.

Works: fileservers = 1.4.14.1, 1.6.1, 1.6.2
       butc = (client side) 1.6.1
       buserver = 1.4.14.1

Fails: fileservers = 1.4.14.1, 1.6.1, 1.6.2
       butc = (client side) 1.6.1, 1.6.2 
       buserver = 1.6.2

For a partition (volset) that doesn't complete the 'backup dump', 
/usr/afs/backup/TL_<port-offset> looks to be waiting for a DumpID 
from the buserver.

---------------------
srv3:/usr/afs/backup:# cat TL_3106 
Tue Mar 26 10:30:11 2013: Starting Tape Coordinator: Port offset 3106   Debug level 0
Tue Mar 26 10:30:11 2013: Token expires: Wed Dec 31 19:00:01 1969

Tue Mar 26 10:31:21 2013: Task 3106001: Dump TSM_srv3_f_135.04
---------------------

whereas for those butc/dump processes that proceed, the subsequent 
lines have more info.

---------------------
srv3:/usr/afs/backup:# head TL_3115
Tue Mar 26 10:30:17 2013: Starting Tape Coordinator: Port offset 3115   Debug level 0
Tue Mar 26 10:30:17 2013: Token expires: Wed Dec 31 19:00:01 1969

Tue Mar 26 10:31:40 2013: Task 3115001: Dump TSM_srv3_o_157.26
Tue Mar 26 10:31:42 2013: Task 3115001: Dump TSM_srv3_o_157.26 (DumpID 1364308301)
Tue Mar 26 10:31:42 2013: Task 3115001: Starting pass 1
Tue Mar 26 10:31:42 2013: Task 3115001: Volume h.abcd.jchen114.backup (1971521033) not dumped - has not been modified since last dump
...
---------------------


The vicep* partitions (or volsets), for which the backup dump/butc 
hang, are not consistent.  If we kill and restart the dump process, 
some of the previously hung volsets finish while others hang.

What info do we need to grab from butc and buserver in order to 
track the problem?

Thanks.


-pkd

-- 
Prasad Dharmasena
University of Maryland, College Park