[OpenAFS] Re: 1.6.2 buserver + butc
Andrew Deason
adeason@sinenomine.net
Wed, 27 Mar 2013 13:44:56 -0500
On Tue, 26 Mar 2013 20:04:15 -0400 (EDT)
Prasad Dharmasena <pkd@glue.umd.edu> wrote:
> The vicep* partitions (or volsets), for which the backup dump/butc
> hang, are not consistent. If we kill and restart the dump process,
> some of the previously hung volsets finish while others hang.
>
> What info do we need to grab from butc and buserver in order to
> track the problem?
I assume there's nothing helpful in BackupLog?
I haven't worked with butc/buserver for a long time, so I don't remember
if there are ways to get more information out of them specifically.
However, just going by what works in general:
One pretty surefire way of getting to know what's happening is to grab a
core from the butc and buserver processes while they are hanging ('gcore
<pid>'). You'll need a developer to look at that to say what's going on,
and those cores will contain sensitive information. But if there is
someone you trust enough with it, that will let you know what's
happening.
Another way that may or may not be useful is to capture a wire dump of
whatever ports butc is using at the time. I don't remember how much of
the net communication is encrypted for that process, so it may not be
useful at all; but if there's some that's not encrypted, it may help
indicate what it's hanging on.
Besides that, the buserver process _may_ give more useful information if
you turn on debugging output. Unfortunately, it doesn't look like the
buserver has a runtime option to turn on debug log messages. You can
turn it on if you're willing to rebuild by changing the value of:
int debugging = 0;
in src/budb/server.c (or change the value by starting the process
manually in a debugger). Depending on how much log data that spews out
when debugging is on (I don't remember), it may impact performance. So,
that's not great either, but it's an option.
--
Andrew Deason
adeason@sinenomine.net