[OpenAFS] Re: Pauses in vos dump commands

E. Chris Garrison ecgarris@iupui.edu
Fri, 14 Oct 2005 08:47:37 -0500


> First, I do a "vos backupsys" for the server.
> Next, I run a shell script that loops through all of the volumes on
> the server, running the following:
> 
>   /usr/local/sbin/vos dump -id $i -time $since | <archive command>
> 
> Where $i is the name of the voluem, and $since is a string based on
> the last full backup.
> 
> (don't worry about the archive command, it takes standard input and
> saves data to tape across the network).
> 
> Most of the time, the vos dumps go quickly by, but every 4 or 5
> volumes, it'll hang, seeming to think about the request a long time,
> not even sending the first byte of data til it unhangs.  Usually this is 
> around 10 minutes.  It doesn't matter if there were zero changes or a 
> ton of changes, and it's not always the same volumes that hang.  
> Once it unhangs, it zips through the dump and probably the next few 
> as well, til it hits another one that makes it hang for no reason I 
> can figure.  The load is 0.00 while it hangs, there I/O is doing
> nothing... it's frustrating.

Another person on the list named Steve suggested I try to substitute
"cat > /dev/null" for the archive command and run strace on the vos
dump and archive.

Well, when I use the "cat > /dev/null", it pauses in the exact same
way.  Good test though, at least it helps make the case that it's the
vos dump not the archive command that's the problem.  

The strace hangs at this point:

  rt_sigprocmask(SIG_BLOCK, [INT CHLD], [CHLD], 8) = 0
  clone(child_stack=0,
  flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
  child_tidptr=0xb7ff2708) = 28554
  rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
  close(3)                                = 0
  rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
  rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
  rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
  rt_sigaction(SIGINT, {0x807603b, [], SA_RESTORER, 0x587a48},
  {SIG_DFL}, 8) = 0
  waitpid(-1,

Not that I know what all the code means, but it's odd that it pauses
in the middle of printing out the command like that.

When it resumes 10 minutes later, it says this:

  Dumped volume home.rsurya.backup in stdout
  [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0) = 28553
  waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0) = 28554
  rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
  rt_sigaction(SIGINT, {SIG_DFL}, {0x807603b, [], SA_RESTORER,
  0x587a48}, 8) = 0
  close(3)                                = -1 EBADF (Bad file
  descriptor)
  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  --- SIGCHLD (Child exited) @ 0 (0) ---
  waitpid(-1, 0xbffd4c9c, WNOHANG)        = -1 ECHILD (No child
  processes)
  sigreturn()                             = ? (mask now [])

Any of this look familiar?

Thanks again,

Chris
--
Chris Garrison			| Principal Mass Storage Specialist
ecgarri@iupui.edu		| Indiana University UITS - RAC - DSSG
W: 317-278-1207 M: 317-250-8649 | Jabber IM: ecgarris@itso.iu.edu