[OpenAFS] Strange problem in 2 volumes that do not "vos move" or "vos restore"

Sat, 2 Sep 2017 17:35:39 +0100

On Sat, Sep 02, 2017 at 10:40:40AM -0400, Jeffrey Altman wrote:
> On 9/2/2017 5:38 AM, Jose M Calhariz wrote:
> > 
> > Hi,
> > 
> > I am one of the maintainers of a OpenAFS cell.  The cell runs on
> > Debian 8.x with OpenAFS 1.6.15-1 from backports.  The clients I use
> > for maintenance are Debian 8.x and Debian 9.x, aka 1.6.15-1 and 1.6.20-2.
> > 
> > I have 2 volumes that refused to do a maintenance "vos move".  I
> > started to investigate one of the volumes.  To my surprise, the "vos
> > restore" fails from a Debian 9.x client.  Then I tried to do a
> > "bos salvage", but it did not found a problem.  Then I upgraded
> > the fileserver to Debian 9.x and repeated the "bos salvage" and the
> > "vos restore" but there is still a problem.
> > 
> > Follow what I think is the relevant logs.  The SalsrvLog on the
> > destination of a vos move is too big, so I can send it compressed by
> > private email.  I am open to do more tests and to provide more logs.
> > The problematic volumes are preventing the a shutdown of the
> > fileservers for a must needed maintenance.
> 
> In the process of assisting several AuriStorFS licensees convert their
> cells, volumes have been encountered that cannot be moved, released, nor
> dumped. The on-disk data structures for the volume were corrupted which
> resulted in failures to create a complete dump stream or data corruption
> within the dump stream.
> 
> I suggest that you focus your attention on the source server.  Any
> volume on the destination server that was partially created can be
> zapped since the volume was never deleted on the source.
> 
> First bump the log level of the volserver to 125 to increase the logging
> to the VolserLog.  Then execute "vos dump" to force the generation of a
> dump stream.  If it fails, then hopefully the problem that triggered the
> failure will be recorded in the VolserLog.  Sadly the OpenAFS logging
> for dump errors is very incomplete so the failures are often silent.
>
> If the "vos dump" command succeeds, then the contents of the stream must
> be faulty.  The VolserLog of the volserver to which a "vos restore" is
> executed might at a log level of 125 contain a clue.

Hi, I have started the volserver with loglevel of 125 on the source
and on the target fileservers.  The "vos dump" runs without errors, is
the "vos restore" to another volume and server that gives an error.

tail -f VolserLog
Sat Sep  2 17:16:38 2017 Starting AFS Volserver 2.0 (/usr/lib/openafs/davolserver -p 16 -udpsize 16777216 -d 125)
Sat Sep  2 17:18:46 2017 1 Volser: CreateVolume: volume 538193049 (test) created
Sat Sep  2 17:19:00 2017 1 Volser: ReadVnodes: Error writing vnode index: Invalid argument; restore aborted
Sat Sep  2 17:19:00 2017 Scheduling salvage for volume 538193049 on part /vicepa over FSSYNC

> 
> At sites that have experienced these problems in the past AuriStor staff
> has either been forced to manually edit the on-disk data structures to
> correct the flaw or the customer has restored the volume from a backup.
> In many cases, the affected volumes had not been modified by end users
> for many years.  If your organization wishes professional assistance,
> feel free to contact me off list.

As the users are not complaining of errors, I think I can use user
tools to move the data into a new volume.  But the data in this
volumes are not all equal and some garbage in it may pass unnoticed.

I am first interested in chasing the possible bug, as this may happen
again in the future.  In short: "Vos dump" and "bos salvage" are OK,
"vos restore" and "vos move" fails.

What more I can do to pinpoint the problem.  What logs do you need?

> 
> Good luck.
> 
> Jeffrey Altman
> 

> begin:vcard
> fn:Jeffrey Altman
> n:Altman;Jeffrey
> org:AuriStor, Inc.
> adr:Suite 6B;;255 West 94Th Street;New York;New York;10025-6985;United States
> email;internet:jaltman@auristor.com
> title:Founder and CEO
> tel;work:+1-212-769-9018
> note;quoted-printable:LinkedIn: https://www.linkedin.com/in/jeffreyaltman=0D=0A=
> 	Skype: jeffrey.e.altman=0D=0A=
> 	
> url:https://www.auristor.com/
> version:2.1
> end:vcard
> 

Kind regards
Jose M Calhariz

-- 
--
Então o Nhonho é um adulto horizontal!!!

-- Chaves