[OpenAFS] DAFS Salvager failure

Jack Neely jjneely@pams.ncsu.edu
Thu, 18 Oct 2012 22:40:31 -0400


Folks,

One of our AFS file servers crashed this afternoon.  OpenAFS 1.6.1 on
RHEL 6 with kernel 2.6.32-279.9.1.el6.x86_64.  It looks like the
salvager hung and eventually the dafileserver stopped responding to
clients.

We're rebooted, fsck'd the ext4 partitions, and finally ran the
dasalvager -force by hand to attempt to correctly salvage the server.
In all cases once the dafs instance starts up, it serves requests, it
dispatches a volume salvage or 4, all the salvager processes get stuck
and we start all over again.  We've salvaged the server multiple times
at this point -- our next hope is that we can restart the file server
with the traditional file server process.  (BTW, 2 and 3 GiB cores from
dafileserver and dasalvager abound.)

SalsrvLog messages are usually along the following:

10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
'FSSYNC'; attempting reconnect to server
10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
'FSSYNC'; attempting reconnect to server
10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597266)
10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
offline failed; trying again...
10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
'FSSYNC'; attempting reconnect to server
10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597265)
10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
offline failed; trying again...
10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'

or

10/18/2012 22:20:49 dispatching child to salvage volume 540007729...
10/18/2012 22:19:33 SYNC_ask: No response on circuit 'FSSYNC'
10/18/2012 22:19:33 SYNC_ask: protocol communications failure on circuit
'FSSYNC'; attempting reconnect to server

and from FileLog (this looks like I'm restoring from backups)

Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
(2574739029)
Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
(3774863615)
Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
(944130375)
Thu Oct 18 22:25:30 2012 Volume 539458481 now offline, must be salvaged.
Thu Oct 18 22:25:30 2012 Scheduling salvage for volume 539458481 on part
/vicepb over SALVSYNC
Thu Oct 18 22:25:31 2012 nUsers == 0, but header not on LRU
Thu Oct 18 22:25:31 2012 SYNC_getCom:  error receiving command
Thu Oct 18 22:25:31 2012 Scheduling salvage for volume 539894230 on part
/vicepb over SALVSYNC
Thu Oct 18 22:25:31 2012 FSYNC_com:  read failed; dropping connection
(cnt=103291)
Thu Oct 18 22:25:37 2012 FSYNC_com:  invalid protocol version
(2023862981)

I've checked, all my binaries are from my 1.6.1 build.  What's going on?

Jack Neely

-- 
Jack Neely <jjneely@ncsu.edu>
Linux Czar, OIT Campus Linux Services
Office of Information Technology, NC State University
GPG Fingerprint: 1917 5AC1 E828 9337 7AA4  EA6B 213B 765F 3B6A 5B89