[OpenAFS-devel] How to create inconsistency in the volserver and my mind.

Jeffrey Hutzelman jhutz@cmu.edu
Thu, 17 Mar 2005 13:08:02 -0500


On Thursday, March 17, 2005 12:33:12 PM +0100 Harald Barth 
<haba@pdc.kth.se> wrote:

>
>> I suppose it's possible you could construct something that does this
>> using  the convert-RO-to-RW functionality that is in very recent
>> servers.  But I'd  have to think about it for a lot longer to convince
>> myself that this would  actually be stable.
>
> Yes. Something like that would be nice.
>
>> Those aren't error messages; they're log messages.  They are normal.
>> The  -overwrite switch doesn't mean the volume already exists; it tells
>> vos what  to do _if_ the volume already exists.  The way it tells that
>> is by trying  to create the volume and looking at the error code.
>
> The problem is that they look dangerous to the non-suspecting sysadmin.
> "Abort, abort - all brace for impact" ;-)

The non-suspecting sysadmin needs to get out of the habit of assuming that 
any output produced by any program must be a horrible fatal error.  Solve 
that problem, and then we can talk about whether the messages are 
meaningful enough.


>> > Tue Mar 15 11:05:19 2005 1 Volser: Delete: volume 537057012 deleted
>> > Tue Mar 15 11:05:19 2005 1 Volser: CreateVolume: volume 537057012
>> > (dah.test.flopp) created Tue Mar 15 11:05:19 2005 1 Volser:
>> > RestoreVolume: Error reading header file for dump; aborted
>> >
>> > And this is the log from the broken -overwrite full which results in
>> > the vl-volser inconsistency.
>
>> Yeah, that makes sense.  The error is referring to the volserver's
>> inability to read the dump header over the wire, which is not
>> unsurprising  since in your example, vos will never send one.
>
> And here, aborted actually means it fell over.

No, it means the volserver aborted the RPC, just like the first case. 
Before, the operation it was aborting was CreateVolume; in this example, 
it's RestoreVolume.  Really, people who want to know the result of a 
command they ran with vos should look at the output of vos, not the 
contents of the volserver log.


>> > cysteine# tail -1 BosLog
>> > Mon Mar 14 18:15:08 2005: fs:vol exited on signal 6
>
>> What version and platform?
>
> We are OpenAFS 1.3.77 built  2005-01-18 on i386 RH9.
>
> Seems to be the threaded beast:

Well, then, that kills my theory that it's the 25-day bug, which only 
affects LWP processes, and apparently only on fairly new Linux.



> cysteine# ldd  /usr/openafs/libexec/openafs/volserver
>         libpthread.so.0 => /lib/i686/libpthread.so.0 (0x4001e000)
>         libresolv.so.2 => /lib/libresolv.so.2 (0x4006f000)
>         libc.so.6 => /lib/i686/libc.so.6 (0x40081000)
>         /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
>
> What is the current status about OpenAFS and Linux threads? I know the
> thread situation on Linux sucks in general, just tell me your best
> practice, ok? :-)

Ok.  My best practice is to run fileservers on SPARC Solaris, thereby 
avoiding the Linux threads mess, the horrible kludge that is the namei 
fileserver, and all sorts of other problems that the rest of you have seen.
:-)

Really, I can't tell you much about OpenAFS and Linux threads.  Maybe 
Derrick can field that one.


-- Jeff