[OpenAFS-devel] How to create inconsistency in the volserver and my mind.

Harald Barth haba@pdc.kth.se
Thu, 17 Mar 2005 12:33:12 +0100 (MET)


> I suppose it's possible you could construct something that does this using 
> the convert-RO-to-RW functionality that is in very recent servers.  But I'd 
> have to think about it for a lot longer to convince myself that this would 
> actually be stable.

Yes. Something like that would be nice.

> Those aren't error messages; they're log messages.  They are normal.  The 
> -overwrite switch doesn't mean the volume already exists; it tells vos what 
> to do _if_ the volume already exists.  The way it tells that is by trying 
> to create the volume and looking at the error code.

The problem is that they look dangerous to the non-suspecting sysadmin.
"Abort, abort - all brace for impact" ;-)

> > Tue Mar 15 11:05:19 2005 1 Volser: Delete: volume 537057012 deleted
> > Tue Mar 15 11:05:19 2005 1 Volser: CreateVolume: volume 537057012
> > (dah.test.flopp) created Tue Mar 15 11:05:19 2005 1 Volser:
> > RestoreVolume: Error reading header file for dump; aborted
> >
> > And this is the log from the broken -overwrite full which results in
> > the vl-volser inconsistency.

> Yeah, that makes sense.  The error is referring to the volserver's 
> inability to read the dump header over the wire, which is not unsurprising 
> since in your example, vos will never send one.

And here, aborted actually means it fell over.

> > Failed to get info about server's -2098337598 address(es) from vlserver
> > (err=0)

> -2098337598 is 0x82EDE8C2 or 130.237.232.194, houting.pdc.kth.se

Which has been in the vlserver for a long time. 

> You'll note the message in question says (err=0).  This message actually 
> shouldn't be printed at all in that case, but the conditional was 
> inadvertently removed between src/volser/vsprocs.c verisons 1.15 and 1.16, 
> in DELTA no-copy-libafs-builds-20021015.  What this delta has to do with 
> changing the way errors are reported in vsprocs, I do not know.

Ooopsi.


> > cysteine# tail -1 BosLog
> > Mon Mar 14 18:15:08 2005: fs:vol exited on signal 6

> What version and platform? 

We are OpenAFS 1.3.77 built  2005-01-18 on i386 RH9. 

Seems to be the threaded beast:

cysteine# ldd  /usr/openafs/libexec/openafs/volserver 
        libpthread.so.0 => /lib/i686/libpthread.so.0 (0x4001e000)
        libresolv.so.2 => /lib/libresolv.so.2 (0x4006f000)
        libc.so.6 => /lib/i686/libc.so.6 (0x40081000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

What is the current status about OpenAFS and Linux threads? I know the
thread situation on Linux sucks in general, just tell me your best
practice, ok? :-)

> How long was it running before it exited?

>From Thu Mar 10 20:41:11 to Mon Mar 14 18:15:08.

It exited either at first vos backup or vos dump operation
from our backup scripts which are invoked 18:15:00. The
scripts seem to need about 8 secs to ask TSM what is
already backed up. 

> Actually, signal 6 is SIGIOT, which generally means an abort.
> It's possible an abort message was written, but went out to the beginning 
> of the log file instead of the end (stdout and stderr don't share a file 
> position)

Nope, did not find anything useful at another place in the file either :-(

Harald.