[OpenAFS-devel] Re: [OpenAFS] Volume root corruptions - anybody seen those? Explained and corrected

Tue, 12 Aug 2008 16:39:26 +0200

--------------050709000404040805050904
Content-Type: text/plain; charset="ISO-8859-15"; format=flowed
Content-Transfer-Encoding: 7bit

In June I reported volumes corrupted by the salvager. Hartmut Reuter 
fixed the problem a while ago suspecting a big/little endian problem 
when setting the length of a salvaged directory (-salvagedir option). 
This wasn't the case, but the error was in that area and his patch 
fixes it!

For completeness, here's a complementary patch that fixes the 
SplitInt64 macro which is the culprit, on all 64-bit machines:

Here's a little refresh:

> 
> on a few (1/1000) volumes the length of the volume root directory would 
> be flagged as something astronomically big, to be corrected to 6144, 
> 8192 or other more modest multiples of 2k. And sure enough, doing the 
> real salvage the root would then be logically truncated (not to say 
> completely messed up) and plenty of orphaned files would appear.
> 
> When you converted that "BIG" number to hexadecimal you got the peculiar 
> pattern 0x180000001800, or 0x080000000800, i.e. always the "final" 
> multiple of 2K ored into that same multiple shifted left by 32 bits!
> 

And it happens like this:

On 64-bit machines SplitInt64 shifts the operand 32 bits to the right 
in order to isolate the high-order 32-bit word. If the operand itself 
is a 32-Bit integer (as is the case with "Length()" in vol-salvage.c), 
the behaviour is undefined according to the C standard! What gcc makes 
out of it is careless or at least unhelpful in my humble opinion, but 
who am I to criticize: it shifts the operand by 32 bits, forgetting 
that on Intel the shift counter is taken modulo 32 (in 32-bit mode), 
hence the shift is a no-op. On SPARC the same thing happens, but here 
it is (odd enough) the Sun compiler who shifts by modulo 32. Probably 
I am alone in regretting that given that both got worried to the point 
of issuing a warning, neither chose the solution closest to the 
underlying mathematical principle, anything resembling 32 divisions by 
2...

Anyway, the corrupted vnode length field which repeats the length in 
two 32-bit word positions is thus explained, and the (attached) simple 
solution is to type-cast the operand to afs_int64 before the shift. 
(There are others: the construct ((i >> 16) >> 16) may fool some 
compilers in doing the right thing... and surprise people looking at 
the code).

Bcc'ed to openafs-bugs.


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155

--------------050709000404040805050904
Content-Type: text/plain; name="p_splitInt_really64"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="p_splitInt_really64"

--- openafs/src/config/stds.h.o0	2005-05-08 08:01:12.000000000 +0200
+++ openafs/src/config/stds.h	2008-08-12 13:51:29.000000000 +0200
@@ -67,7 +67,7 @@
 #define NonZeroInt64(a)                (a)
 #define Int64ToInt32(a)    (a) & 0xFFFFFFFFL
 #define FillInt64(t,h,l) (t) = (h); (t) <<= 32; (t) |= (l);
-#define SplitInt64(t,h,l) (h) = (t) >> 32; (l) = (t) & 0xFFFFFFFF;
+#define SplitInt64(t,h,l) (h) = ((afs_int64)t) >> 32; (l) = (t) & 0xFFFFFFFF;
 #else /* AFS_64BIT_ENV */
 typedef long afs_int32;
 typedef unsigned long afs_uint32;

--------------050709000404040805050904--