[OpenAFS] Here is a strange...

JoelTompkins@BC.com JoelTompkins@BC.com
Fri, 28 Jun 2002 15:06:59 -0600


This happened with Transarc AFS, so hopefully there's still a few =
Transarc wonks out there...

I was called back from lunch today because our e-commerce sites were =
down. Trying to access the boxes (IBM 7046-B50's, running AIX 4.3.3. =
Rlevel 9)), even as root from the console, they would hang without a =
console prompt after entering the root password.=20

At the same time, I was getting calls about AFS users getting errors =
waiting for a busy volume(s?). We have three AFS servers, one =
read/write, and two read/only replicates. Being up against the wall, I =
hoped that by doing a bos restart -bosserver on the AFS servers would =
clear up the AFS problem at least.

 I did a bos restart -bosserver on the read/write, and it came right =
back. I tried a boss restart -bosserver on one of the read/only =
replicate servers, and it hung. I did a bos restart -bosserver on the =
other read/only replicate server, and it came right back. I went back to =
the first read/only replicate server and did a ctrl-c to get the command =
prompt back. Ps-ef showed the AFS processes still running out there, and =
they had been for a long time. I tried to bos restart -bosserver again, =
and it hung again. I figured the only way to get the AFS control back =
fast was to reboot the server, so I hit ctrl-c and shutdown -Fr.=20

Low and behold, everything cleared up on all of the e-commerce boxes =
(AFS also cleared up). It was like AFS was somehow causing problems on a =
whole pile of servers because of this one server hangup.=20

One more detail: We have a once-a minute script running on the =
read/write server looking for changes (for web site updates) to release =
to read/only. It started to pile up about a half hour or 45 minutes =
before the sites went down. There was obviously a problem causing them =
to wait for the release (seeing the same busy volume status?), so they =
were out there waiting. After the reboot, all of the releases came at =
once, with rc=3D0's. (we're changing it now, so it will check for =
another instance of itself and mail us if there's something hanging out =
there. We always thought our script would pass or fail with rc0 or 1, =
not wait out there...). I don't know if the problem started before the =
first release hang, or if a pile-up of scripts requesting releases =
caused it.

One last detail: this cell services servers in a DMZ for e-commerce, but =
it also services internal subnets for other servers. Only the DMZ =
servers were affected. The AFS servers are all in the DMZ, which is =
something currently under discussion.

Anybody got a clue??? AFS logs said nothing, syslog said nothing, and =
the errlog was cleaned out on the reboot (I'm wondering about that =
too...).

I never thought an afs server problem could hang whole racks of other =
servers, so this has people looking askance at AFS now. It has been fine =
for a year and a half....

Thanks In Advance!!!

Joel Tompkins
Senior Information Systems Engineer
Boise Cascade Corporation
208-384-6415
joeltompkins@bc.com
"If you want the world to turn, you have to crank it." - Mike Riley