[OpenAFS-devel] Re: [OpenAFS] Volume root corruptions - anybody seen those? Explained
and corrected
Rainer Toebbicke
rtb@pclella.cern.ch
Tue, 12 Aug 2008 16:39:26 +0200
--------------050709000404040805050904
Content-Type: text/plain; charset="ISO-8859-15"; format=flowed
Content-Transfer-Encoding: 7bit
In June I reported volumes corrupted by the salvager. Hartmut Reuter
fixed the problem a while ago suspecting a big/little endian problem
when setting the length of a salvaged directory (-salvagedir option).
This wasn't the case, but the error was in that area and his patch
fixes it!
For completeness, here's a complementary patch that fixes the
SplitInt64 macro which is the culprit, on all 64-bit machines:
Here's a little refresh:
>
> on a few (1/1000) volumes the length of the volume root directory would
> be flagged as something astronomically big, to be corrected to 6144,
> 8192 or other more modest multiples of 2k. And sure enough, doing the
> real salvage the root would then be logically truncated (not to say
> completely messed up) and plenty of orphaned files would appear.
>
> When you converted that "BIG" number to hexadecimal you got the peculiar
> pattern 0x180000001800, or 0x080000000800, i.e. always the "final"
> multiple of 2K ored into that same multiple shifted left by 32 bits!
>
And it happens like this:
On 64-bit machines SplitInt64 shifts the operand 32 bits to the right
in order to isolate the high-order 32-bit word. If the operand itself
is a 32-Bit integer (as is the case with "Length()" in vol-salvage.c),
the behaviour is undefined according to the C standard! What gcc makes
out of it is careless or at least unhelpful in my humble opinion, but
who am I to criticize: it shifts the operand by 32 bits, forgetting
that on Intel the shift counter is taken modulo 32 (in 32-bit mode),
hence the shift is a no-op. On SPARC the same thing happens, but here
it is (odd enough) the Sun compiler who shifts by modulo 32. Probably
I am alone in regretting that given that both got worried to the point
of issuing a warning, neither chose the solution closest to the
underlying mathematical principle, anything resembling 32 divisions by
2...
Anyway, the corrupted vnode length field which repeats the length in
two 32-bit word positions is thus explained, and the (attached) simple
solution is to type-cast the operand to afs_int64 before the shift.
(There are others: the construct ((i >> 16) >> 16) may fool some
compilers in doing the right thing... and surprise people looking at
the code).
Bcc'ed to openafs-bugs.
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985 Fax: +41 22 767 7155
--------------050709000404040805050904
Content-Type: text/plain; name="p_splitInt_really64"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="p_splitInt_really64"
--- openafs/src/config/stds.h.o0 2005-05-08 08:01:12.000000000 +0200
+++ openafs/src/config/stds.h 2008-08-12 13:51:29.000000000 +0200
@@ -67,7 +67,7 @@
#define NonZeroInt64(a) (a)
#define Int64ToInt32(a) (a) & 0xFFFFFFFFL
#define FillInt64(t,h,l) (t) = (h); (t) <<= 32; (t) |= (l);
-#define SplitInt64(t,h,l) (h) = (t) >> 32; (l) = (t) & 0xFFFFFFFF;
+#define SplitInt64(t,h,l) (h) = ((afs_int64)t) >> 32; (l) = (t) & 0xFFFFFFFF;
#else /* AFS_64BIT_ENV */
typedef long afs_int32;
typedef unsigned long afs_uint32;
--------------050709000404040805050904--