[OpenAFS] the best distributed file system

Matt Cocker matt@cs.auckland.ac.nz
Wed, 19 Jun 2002 11:45:04 +1200


You may not be alone. We have seen problems similar to what you describe
in two situations. The first is if we are running routed in listening
mode and restart routed while people are accessing volumes on this
server. A reboot (reset actually) is generally required as the FS 
process has the "D" label.

The second time we have seen this problem is when we had a raid problem. 
One of the raid disks would fail but would not be taken offline. The FS 
process would try to write to the logical disk and generate a SCSI 
aborting error and the FS process would go "D", now one with volumes on 
that server can access them and the FS can not be killed (reset and raid 
disk rebuild required). Unfortunately if the machine is in this state 
for two long the DB servers also need AFS restarted to get rewrite 
access to disks. (we have had 12 of these failures due to a bad batch of 
disks - now all 60 disk are being replaced by Dell). Note: the linux 
clients have a really annoying habit at this point as they can not 
access any directories that have a sub-directory which is a mount point 
for a volume from a dead server (time out seems to be very very long - 
NT client seems to only be effected if you access the dead volume).

Another problem we have is that we have two locations with a DB server 
and FS at each locations and it all works fine until the link between 
the two sites goes down at which point no one has write access to their 
volumes (even at the site with the present RW controller) until the link 
comes back.


cheers

matt

  system disk failure it will active FS process
 >
 > Maybe/maybe not... One negative of our problem is that no one else
 > sees it. At least the CoW problem is known, just not fixed.
 >
 > In our case, we occasionally see the file or vol server just stop
 > handling requests... No indication of the problem. Rarely we have it
 > segfault, but have not switched it with the lwp based fileserver yet
 > to get a backtrace.
 >
 >