[OpenAFS] Problems with fsck on Solaris 9

Douglas E. Engert deengert@anl.gov
Fri, 12 Nov 2004 07:33:13 -0600


Stephen Joyce wrote:
> Doug (and anyone else w/ knowledge wrt Solaris),
> 
> I've still got fsck problems on solaris 9... I downloaded 1.2.13, but it
> doesn't appear to have Doug's patches or the solaris interleave patch
> applied... so I retrieved the patches from CVS, applied them, and built
> from source (without problems).
> 
> However I'm still getting the following error:
> ----Open AFS (R) openafs 1.2.13 fsck----
> ** /dev/rdsk/c1t0d0s3
> BAD SUPER BLOCK: VALUES IN SUPER BLOCK DISAGREE WITH THOSE IN FIRST ALTERNATE
> USE AN ALTERNATE SUPER-BLOCK TO SUPPLY NEEDED INFORMATION;
> eg. fsck [-F ufs] -o b=# [special ...]
> where # is the alternate super block. SEE fsck_ufs(1M).

We have not seen this, but it sounds like you have much larger partitions.

It looks like the code reads the main and an alternate superblocks,
then resets in the alternate block selected fields from the main that it knows
are different. Then it does a memcmp at line vfsck/setup.c:755 to see if they
are identical. It does this very early in the process.

So the problem could be that there is a real problem, it read the wrong block,
or it did not reset a selected field before the compare.
I would supsect the last.

Can you make some changes to src/vfsck/setup.c to dump the
two block, or at least print the offset of where they differ?

Something else to try: run the AFS and Sun FSCK under truss to look
at what blocks they are reading, This was how I found the other bug.
But the Sun fsck may corrupt any AFS volumes on the disk.

Something else to try, is create a smaller partition. There might be
some problem with converting a block number to a byte offset where it
needs more the 32 bits.


> 
> This is the same error I was getting previously.  I have newfs'ed all of
> the partitions, and on reboot the system pronounced the partitions OK..
> however after creating a single new volume, subsequent reboots exhibit the
> same fsck error.
> 
> Interestingly, if I refrain from mounting the drives at boot-time, mount
> them manually, and restart the fileserver, the partition looks OK and
> the data intact.  Running solaris' /usr/lib/fs/ufs/fsck on one of the
> (empty) partitions -- yes, I know it destroys any data present --
> pronounces the filesystem clean.
> 
> Assuming that there's nothing unique(*) about my circumstances, and my
> hardware is not failing in subtle ways, it seems that the disk is
> actually OK and openafs' fsck is still confused.  Or is it possible I'm
> overlooking some other change?
> 
> Any help is appreciated.
> 
> 
>>uname -v
> 
> Generic_117171-11
> 
> (*) My /vicepX partitions are on an external promise raid array.  The total
> disk size is 1.3TB, divided into (7) 200GB partitions.  No errors are
> apparent and it appears to function normally when used as a plain UFS disk.
> 
> Cheers,
> Stephen
> 
> If voting could really change things, it would be illegal.
> 
> On Fri, 5 Nov 2004, Douglas E. Engert wrote:
> 
> 
>>I sent in a bug report and patch on 11/2 See bug 15927.
>>Basicly it adds a prototype for bread and bwrite into fsck.h
>>
>>You may also need the patch to the src/vfsck/setup.c  added to the CVS in August to
>>get it to compile on Solare 9 if the sys/fs/ufs_fs.h does has been updated
>>
>>
>>
>>Stephen Joyce wrote:
>>
>>>Thanks for working on this.  Is there a solution yet?  I have a development
>>>machine (solaris 9, openafs 1.2.11) which I patched last night (before
>>>reading the archives--doh!) and it appears to have the same, or a similar,
>>>problem (it was fine before applying the newest patches):
>>>
>>>
>>>The system is coming up.  Please wait.
>>>The /vicepa file system (/dev/rdsk/c1t0d0s0) is being checked.
>>>----Open AFS (R) openafs 1.2.11 fsck----
>>>/dev/rdsk/c1t0d0s0: /dev/rdsk/c1t0d0s0: BAD SUPER BLOCK: VALUES IN SUPER BLOCK D
>>>ISAGREE WITH THOSE IN FIRST ALTERNATE
>>>
>>>/dev/rdsk/c1t0d0s0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
>>>
>>>WARNING - Unable to repair one or more of the following filesystem(s):
>>>        /dev/rdsk/c1t0d0s0
>>>Run fsck manually (fsck filesystem...).
>>>Exit the shell when done to continue the boot process.
>>>
>>>
>>>(using an alternate superblock doesn't work either).
>>>
>>>While a fix for fsck would be great, if anyone knows exactly which patch to
>>>back-out of, please let me know.  If not, I'll be glad to start trying the
>>>likely suspects--I just don't want to duplicate effort.
>>>
>>>Cheers,
>>>Stephen
>>>
>>>
>>>On Tue, 2 Nov 2004, Douglas E. Engert wrote:
>>>
>>>
>>>
>>>>I think I found the cause of fsck problem. I am concernd that if
>>>>the fsck is run against the cooked device rather then the raw
>>>>device, it could actually cause damage, rather then doing nothing
>>>>and failing.
>>>>
>>>>Solaris 9 in ufs_fs.h changes fsbtodb:
>>>>
>>>> #ifdef KERNEL
>>>> #define fsbtodb(fs, b)  (((daddr_t)(b)) << (fs)->fs_fsbtodb)
>>>> #else /* KERNEL */
>>>> #define fsbtodb(fs, b)  (((diskaddr_t)(b)) << (fs)->fs_fsbtodb)
>>>> #endif /* KERNEL */
>>>>
>>>>Previous versions had:
>>>>
>>>> #define fsbtodb(fs, b)  ((b) << (fs)->fs_fsbtodb)
>>>>
>>>>Note the type cast to diskaddr_t which is a long long.
>>>>
>>>>The vfsck/setup.c uses this in calls to bread in src/utilities.c
>>>>But bread is expecting a daddr_t which is a long.
>>>>
>>>>Thus the mismatch between. There is no common declaration
>>>>of bread for the compiler to catch the mismatch.
>>>>
>>>>This causes a read to fail with the wrong address and wrong length
>>>>and fsck to not do anything usefull.
>>>>
>>>>The mismatch need to be fixed. A related poblem is that Solaris
>>>>fsck is using large file  support, but the AFS vfsck is not.
>>>>
>>>>This was found using truss on an empty file system, running
>>>>the Solaris fsck and the AFS vfsck.
>>>>
>>>>I will be looking at a fix later today.
>>>>
>>>>Derrick J Brashear wrote:
>>>>
>>>>
>>>>>On Sun, 31 Oct 2004, Brian Sebby wrote:
>>>>>
>>>>>
>>>>>
>>>>>># fsck /vicepa
>>>>>>----Open AFS (R) openafs 1.2.11 fsck----
>>>>>>** /dev/rdsk/c0t9d0s0
>>>>>>
>>>>>>CANNOT READ: BLK 0
>>>>>>CONTINUE? [yn] y
>>>>>
>>>>>
>>>>>fsck the cooked device (/dev/dsk/c0t9d0s0). you may need to use a
>>>>>wrapper or to patch vfsck to do it.
>>>>
>>>>That apears to cover up the problem, as it will still read the wrong
>>>>block, but with any length.  When using the raw device, the length
>>>>has to be a multiple of the block size which it was not because it
>>>>was the wrong length which caused the failure.
>>>>
>>>>Running it this way could cause damage later if the blocks where written
>>>>to the wrong locations.
>>>>
>>>>
>>>>
>>>>>you should have mentioned this was the error the other night, it would
>>>>>have jogged my memory
>>>>>
>>>>>_______________________________________________
>>>>>OpenAFS-info mailing list
>>>>>OpenAFS-info@openafs.org
>>>>>https://lists.openafs.org/mailman/listinfo/openafs-info
>>>>>
>>>>>
>>>>>
>>>>
>>>>--
>>>>
>>>> Douglas E. Engert  <DEEngert@anl.gov>
>>>> Argonne National Laboratory
>>>> 9700 South Cass Avenue
>>>> Argonne, Illinois  60439
>>>> (630) 252-5444
>>>
>>>
>>>
>>>
>>--
>>
>>  Douglas E. Engert  <DEEngert@anl.gov>
>>  Argonne National Laboratory
>>  9700 South Cass Avenue
>>  Argonne, Illinois  60439
>>  (630) 252-5444
>>
> 
> 
> 
> 

-- 

  Douglas E. Engert  <DEEngert@anl.gov>
  Argonne National Laboratory
  9700 South Cass Avenue
  Argonne, Illinois  60439
  (630) 252-5444