[OpenAFS] Re: Pauses in vos dump commands

E. Chris Garrison ecgarris@iupui.edu
Tue, 18 Oct 2005 13:30:37 -0500


Hi Frank,

On Tue, Oct 18, 2005 at 09:31:02AM +0200, Frank Burkhardt wrote:
> On Fri, Oct 14, 2005 at 08:47:37AM -0500, E. Chris Garrison wrote:
> 
> [snip]
> 
> It's not. It means, that waitpid() is called, which blocks the program until
> a child process terminates. If you want to see, what's going on within the
> child processes, use 'strace -f' instead of 'strace' to (f)ollow child
> processes spawned by the straced process.
> 
> A guess of mine: Maybe one of your DB-Servers or one of your DNS-Servers is down?
> This is a problem, I have in one of my cells although I have to wait ~ 30s instead
> of 10min.

Okay, thanks for the 'strace -f' tip.  It now produces 77,000 lines
per volume backed up, and it seems it has significant pauses during
"select" statements:

  [pid 32122] select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
  [pid 32122] getitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={3342, 235902}}) = 0
  [pid 32122] gettimeofday({1129658707, 640781}, NULL) = 0
  [pid 32122] select(4, [3], NULL, NULL, {15, 543222}) = 0 (Timeout)
  [pid 32122] gettimeofday({1129658723, 182210}, NULL) = 0
  [pid 32122] select(4, [3], NULL, NULL, {0, 1793}) = 0 (Timeout)
  [pid 32122] gettimeofday({1129658723, 184096}, NULL) = 0
  [pid 32122] getitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={3326, 692265}}) = 0
  [pid 32122] select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
  [pid 32122] getitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={3326, 692265}}) = 0
  [pid 32122] gettimeofday({1129658723, 184623}, NULL) = 0
  [pid 32122] getitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={3326, 691266}}) = 0
  [pid 32122] sendmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(7003), sin_addr=inet_addr("134.68.142.53")}, msg_iov(2)=[{"\212c\36\261[.\266\24\0\0\0\1\0\0\0\1\0\0\0\31\1\7\0\2"..., 28}, {"\0\0\2\7\0\0\0\4home", 12}], msg_controllen=0, msg_flags=0}, 0) = 40
  [pid 32122] gettimeofday({1129658723, 185118}, NULL) = 0
  [pid 32122] select(4, [3], NULL, NULL, {15, 799505}

Well, the pauses are only 10-15 seconds in length a lot of the time, 
but with a large number of select statements like that, it adds up 
to 10 minutes.  Oddly, most entries take seconds to do a 'vos dump',
it's just a few (1 in 3 or 1 in 4) will take 10 minutes.  None of the 
volumes has much data yet.  

We have just three database servers, and they seem to be all up and 
running.  I'm still at a loss as to what could be causing this.

Chris
--
Chris Garrison			| Principal Mass Storage Specialist
ecgarris@iupui.edu		| Indiana University UITS - RAC - DSSG
W: 317-278-1207 M: 317-250-8649 | Jabber IM: ecgarris@itso.iu.edu