[OpenAFS] DAFS Salvager failure

Jeffrey Altman jaltman@secure-endpoints.com
Fri, 19 Oct 2012 12:03:51 -0400


This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigB082A80F3FB4B9275A7B55C3
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

If you have core files from dasalvager and dafileserver then the
processes have terminated abnormally.   If you have an OpenAFS support
provider I suggest you contact them with a support request.

Note that this mailing list is likely to be very quiet over the next
24 to 48 hours as the core developers are in transit due to the end
of the European AFS and Kerberos Conference.

If you do not have a support provider, please open a ticket in OpenAFS
RT by sending mail to openafs-bugs@openafs.org  Please include in the
report stack traces obtained from the core files.  They will provide the
first clue as to what is failing since nothing is evident in the log=20
files.
Be sure to also look at the *.old log files.

Jeffrey Altman


On Thursday, October 18, 2012 10:40:31 PM, Jack Neely wrote:
> Folks,
>
> One of our AFS file servers crashed this afternoon.  OpenAFS 1.6.1 on
> RHEL 6 with kernel 2.6.32-279.9.1.el6.x86_64.  It looks like the
> salvager hung and eventually the dafileserver stopped responding to
> clients.
>
> We're rebooted, fsck'd the ext4 partitions, and finally ran the
> dasalvager -force by hand to attempt to correctly salvage the server.
> In all cases once the dafs instance starts up, it serves requests, it
> dispatches a volume salvage or 4, all the salvager processes get stuck
> and we start all over again.  We've salvaged the server multiple times
> at this point -- our next hope is that we can restart the file server
> with the traditional file server process.  (BTW, 2 and 3 GiB cores from=

> dafileserver and dasalvager abound.)
>
> SalsrvLog messages are usually along the following:
>
> 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circui=
t
> 'FSSYNC'; attempting reconnect to server
> 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circui=
t
> 'FSSYNC'; attempting reconnect to server
> 10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
> errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597266)
> 10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
> 10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
> offline failed; trying again...
> 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circui=
t
> 'FSSYNC'; attempting reconnect to server
> 10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
> errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597265)
> 10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
> 10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
> offline failed; trying again...
> 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
>
> or
>
> 10/18/2012 22:20:49 dispatching child to salvage volume 540007729...
> 10/18/2012 22:19:33 SYNC_ask: No response on circuit 'FSSYNC'
> 10/18/2012 22:19:33 SYNC_ask: protocol communications failure on circui=
t
> 'FSSYNC'; attempting reconnect to server
>
> and from FileLog (this looks like I'm restoring from backups)
>
> Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> (2574739029)
> Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> (3774863615)
> Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> (944130375)
> Thu Oct 18 22:25:30 2012 Volume 539458481 now offline, must be salvaged=
=2E
> Thu Oct 18 22:25:30 2012 Scheduling salvage for volume 539458481 on par=
t
> /vicepb over SALVSYNC
> Thu Oct 18 22:25:31 2012 nUsers =3D=3D 0, but header not on LRU
> Thu Oct 18 22:25:31 2012 SYNC_getCom:  error receiving command
> Thu Oct 18 22:25:31 2012 Scheduling salvage for volume 539894230 on par=
t
> /vicepb over SALVSYNC
> Thu Oct 18 22:25:31 2012 FSYNC_com:  read failed; dropping connection
> (cnt=3D103291)
> Thu Oct 18 22:25:37 2012 FSYNC_com:  invalid protocol version
> (2023862981)
>
> I've checked, all my binaries are from my 1.6.1 build.  What's going on=
?
>
> Jack Neely
>


--------------enigB082A80F3FB4B9275A7B55C3
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)

iQEcBAEBAgAGBQJQgXnqAAoJENxm1CNJffh4waMH+wdTOQJsWp4VrbZlNtbLGR6z
Kb7DYb7y+Nu1RvydeNMwMf80vGZUkIWKIliW5dTpwS61ierEkgb0jJf4r1FYihV9
iC/F8oD/VDNbdvhOk4DYQtzWW5xSGBq0MhPkiAup1v7GJiPyyjSNWXK+3nsVXJsf
ZXKJ54uWpy0euauU+j/L5woNpNnbiByBhs+8g3ztQoN9q1CS/cztMkaygmlyJsxF
0q2z2uGb8Wy/U2/YVLSeAkT99Q37y2VfJJXWO85rQ4XUlTOXzAzWLWMQCezA2qvX
lK0US0F7UsvJPs3FGQRQoXsosZ9MdJFTLlr7AHFpYlISyqhTuSvRIHwR459CuTU=
=G7Pn
-----END PGP SIGNATURE-----

--------------enigB082A80F3FB4B9275A7B55C3--