[OpenAFS] AFS file inconsitency one character wrong

Marcus Watts mdw@umich.edu
Wed, 17 Jan 2007 22:13:32 -0500

> Message-ID: <459E8D6E.8010202@gmx.net>
> From: Duc Bao Ta <xiedebao@gmx.net>
> To: openafs-info@openafs.org
> Subject: [OpenAFS] AFS file inconsitency one character wrong
> Sender: openafs-info-admin@openafs.org
> Errors-To: openafs-info-admin@openafs.org
> Hallo,
> I have a serious problem with AFS. I am using Debian sarge Linux 2.6.15 
> with openafs 1.4.1 with openafs-client from backport. We have a cluster 
> of ~20 computers and three fileservers. All clients have identical 
> installation (only hardware differs from group to group).
> The problem is that on one computer a two (so far) files that look on 
> that computer different than on other identical computers.
> The first file was only named wrong, one letter was capital instead of 
> non-capital. The second file was the same inconsistency, but this time 
> it was a character in the file.
> I can copy e.g. from my machine this file in my local home directory and 
> then via ssh to the "faulty" computer I can copy the same file again to 
> my local home. "diff" tells me they are different and a hex output 
> revealed that there is on difference.
> I am really worried, beacuse I cannot trust that machine anymore!?
> Can anyone comment?

Like others have said, almost surely hardware.  Others have said memory
(very likely), or disk (less likely--but possible).  It's also possible
for a faulty network to do this, or bad motherboard logic.  Most modern
network protocols has at least basic crc logic in them and will discard
bad frames.  Most disk drive hardware since the 60's has ecc and crc,
and will retry reads that fail.

Things to check:
memory - do you have ECC?
	You want on ECC on servers or machines
	that process data you care about.
	Most consumer machines do NOT have ECC, and the
	sales critters at ye olde neighborhood computer store
	will tell you ECC isn't necessary.
	Don't buy servers at ye olde neighborhood computer
	Also, run memcheck86.
bios settings
	check cpu speed, bus speeds, etc.  Usually there's
	several for memory, probably including enabling ECC.
	ECC is less useful if it can't signal uncorrectable errors.
	Usually these are very confusing.  Getting it wrong may result
	in a computer that either works fine, works most of the time,
	fails quickly, or in extreme circumstances, catches fire and
	self-destructs.  The latter is unlikely and with good quality
	hardware impossible.  Since you apparently have multiple
	examples of "identical" hardware, checking bios settings for
	consistency should be easy.
temperature and cooling
	high internal temperatures will usually marginalize logic.
	There's usually a max temperature above which the
	memory (and other logic) won't work.  You probably
	aren't intentionally doing this, but you should still
	check ambient temperature, air flow, correct fan operation,
	internal case temperature, etc.  Most fancier server
	class machines include internal hardware to monitor
	temperature and fans and with the correct software
	will page somebody when these go out of spec.
dmesg output.  Is your disk controller unhappy with the world?
	Does the system config match what the os expects?

Needless to say, if you don't trust a server, don't put it
into production.

I once encountered a Apple ][ that had an interesting floppy disk
failure mode.  It had gone into permament "write-only" mode.
It would even write on write-protected floppies.  This was not
a good thing.

When I first started working with AFS at umich, they had a "proteon ring"
for the campus backbone.  Part of it was apparently being run out
of spec -- there were runs between central & north campus where
the cable runs exceeded the maker's spec.  It still worked, mostly.
But it had something called "data sensitivity", which is to say, certain
rare bit-patterns in packets would be altered in transit and dropped
by the recipient.  For rx, if this happened, that meant the connection
was hosed.  Packets sent to see if the connection was alive always
worked, and attempts to send the bad packet always failed.
This lead to a perpetual stalemate.

					-Marcus Watts