[OpenAFS] the best distributed file system
Matt Cocker
matt@cs.auckland.ac.nz
Wed, 19 Jun 2002 11:45:04 +1200
You may not be alone. We have seen problems similar to what you describe
in two situations. The first is if we are running routed in listening
mode and restart routed while people are accessing volumes on this
server. A reboot (reset actually) is generally required as the FS
process has the "D" label.
The second time we have seen this problem is when we had a raid problem.
One of the raid disks would fail but would not be taken offline. The FS
process would try to write to the logical disk and generate a SCSI
aborting error and the FS process would go "D", now one with volumes on
that server can access them and the FS can not be killed (reset and raid
disk rebuild required). Unfortunately if the machine is in this state
for two long the DB servers also need AFS restarted to get rewrite
access to disks. (we have had 12 of these failures due to a bad batch of
disks - now all 60 disk are being replaced by Dell). Note: the linux
clients have a really annoying habit at this point as they can not
access any directories that have a sub-directory which is a mount point
for a volume from a dead server (time out seems to be very very long -
NT client seems to only be effected if you access the dead volume).
Another problem we have is that we have two locations with a DB server
and FS at each locations and it all works fine until the link between
the two sites goes down at which point no one has write access to their
volumes (even at the site with the present RW controller) until the link
comes back.
cheers
matt
system disk failure it will active FS process
>
> Maybe/maybe not... One negative of our problem is that no one else
> sees it. At least the CoW problem is known, just not fixed.
>
> In our case, we occasionally see the file or vol server just stop
> handling requests... No indication of the problem. Rarely we have it
> segfault, but have not switched it with the lwp based fileserver yet
> to get a backtrace.
>
>