[OpenAFS] many packet are as rx_ignoreAckedPacket and meltdown
Robert Banz
banz@umbc.edu
Fri, 6 Oct 2006 11:31:37 -0400
>
>> First, upgrade your fileserver an actual "production" release,
>> such as 1.4.1. 1.3.81 was pretty good, but, not without
>> problems. (1.4.1 is not without problems, but with less.)
>
> We are thinking of that as a one (last) of possibility, but we are
> running tens of linux (Debian/stable) servers (not only AFS) as a
> part of our distributed computing environment and we are trying to
> keep our server configuration as close as possible to stable dist.
> And short summary: we don't have any significant AFS problems with
> same configuration for 1+years...
Keeping with "random linux distro's" idea of stable for your AFS code
is not a good idea. Stick with OpenAFS's idea of stable -- and while
for short periods I've ran "development" (e.g late 1.3.*) code on my
production AFS servers when I was in a pinch, stick to the production
releases. Ignore what Debian thinks, because they don't know what
they're talking about ;)
>> Second, when your server goes into a this state, does it come out
>> of it naturally or do you have to restart it?
>
> Actually, this state can "freeze" many of our users and services
> (even if affected server servers RO replicas only... and yes, I
> really don't understand this behavior...) and FS is unable to
> return to "normal" state at reasonable time (actually / reasonable
> time is pretty small for us/our users...). So, we are trying to
> "solve" our current problems with fs restart. :-(
>
> (
> As you can see from original post, FS is still alive, but has no
> idle threads. Waiting connections (clients) oscillate around 200
> and "probably" could be serve in tens of minutes...
> )
You could have the "horrible" host callback table mutex lockup
problem. The most for-certain way to discover this is to generate a
core from your running fileserver at the time (on Solaris I use
gcore, but you could also kill -SEGV it instead of restarting),
attach a debugger to the core, and see where the threads are
sitting. If you've compiled your OpenAFS distribution with --enable-
debug (which you should), and you examine the stack trace some of the
threads, you may see a lot of them here:
=>[5] CallPreamble(acall = ???, activecall = ???, tconn = ???, ahostp
= ???) (optimized), at 0x8082178 (line ~315) in "afsfileprocs.c"
(dbx) list
315 H_LOCK;
316 retry:
317 tclient = h_FindClient_r(*tconn);
318 thost = tclient->host;
319 if (tclient->prfail == 1) { /* couldn't get the CPS */
...
If this is the case...well...there's no for-sure way around it right
now, though some people, IIRC, have been working on some code changes
to avoid it. Some steps you can take, though, to mitigate the
problem involve making sure all your clients respond promptly on
their AFS callback ports (7001/udp). With all of the packet manglers
out on the network (hostbased firewalls, overanxious network
administrators, etc.) you may find things "in the way" of the AFS
fileservers contacting their clients on the callback port. One of
the things that can cause this type of "lockup" are requests to these
clients timing out / taking a long time... If things have been
working fine for "awhile" and now they don't, network topology/
firewall changes like this could be a culprit.
I've attached a script that I periodically run to see how many "bad"
clients are using my fileservers, so that I may try to track them
down and swat at them...
-----
#!/usr/local/bin/perl
$| = 1;
sub getclients {
my $server = shift @_;
my %ips;
print STDERR "getting connections for $server\n";
open(RXDEBUG, "/usr/afsws/etc/rxdebug -allconnections
$server|") || die
"cannot exec rxdebug\n";
while(<RXDEBUG>) {
if ( /Connection from host ([^, ]+)/ ) {
my $ip = $1;
if ( ! defined($ips{$ip}) ) {
$ips{$ip} = $ip;
}
}
}
close RXDEBUG;
return keys(%ips);
}
sub checkcmdebug {
my $client = shift @_;
print STDERR "checking $client\n";
open(CMDEBUG, "/usr/afsws/bin/cmdebug -cache $client 2>&1|")
|| die "canot exec cmdebug\n";
while(<CMDEBUG>) {
if ( /server or network not responding/ ) {
return 0;
}
}
close CMDEBUG;
return 1;
}
my %clients;
# modify this to run getclients on all of your AFS servers...
foreach my $y ( "ifs1", "ifs2", "hfs1", "hfs2", "bfs1", "hfs11",
"hfs12" ) {
foreach my $x ( &getclients($y.".afs.umbc.edu") ) {
$clients{$x}++;
}
}
use Socket;
foreach my $x ( keys(%clients) ) {
if ( ! &checkcmdebug($x) ) {
print "$x";
use Socket;
my $iaddr = inet_aton($x);
my $name = gethostbyaddr($iaddr, AF_INET);
print "($name)\n";
}
}