[OpenAFS] DAFS Salvager failure

Jack Neely jjneely@ncsu.edu
Fri, 19 Oct 2012 13:44:11 -0400


Thanks Jeffrey!

I've created [rt.central.org #131372] with a follow up.  

At this point this one server is running the traditional fileserver.
There were 3 volumes that would not come online -- that even caused the
traditional salvager to crash.  We restored those from tape and,
finally, the server is up and running.

The FSSYNC errors were the only thing in the log message that seemed to
coordinate with the dasalvager getting stuck.  Well, and the core files.
The backtraces indicate the dafileserver called osi_Panic from the
FSSYNC related functions.

Jack

On Fri, Oct 19, 2012 at 12:03:51PM -0400, Jeffrey Altman wrote:
> If you have core files from dasalvager and dafileserver then the
> processes have terminated abnormally.   If you have an OpenAFS support
> provider I suggest you contact them with a support request.
> 
> Note that this mailing list is likely to be very quiet over the next
> 24 to 48 hours as the core developers are in transit due to the end
> of the European AFS and Kerberos Conference.
> 
> If you do not have a support provider, please open a ticket in OpenAFS
> RT by sending mail to openafs-bugs@openafs.org  Please include in the
> report stack traces obtained from the core files.  They will provide the
> first clue as to what is failing since nothing is evident in the log 
> files.
> Be sure to also look at the *.old log files.
> 
> Jeffrey Altman
> 
> 
> On Thursday, October 18, 2012 10:40:31 PM, Jack Neely wrote:
> > Folks,
> >
> > One of our AFS file servers crashed this afternoon.  OpenAFS 1.6.1 on
> > RHEL 6 with kernel 2.6.32-279.9.1.el6.x86_64.  It looks like the
> > salvager hung and eventually the dafileserver stopped responding to
> > clients.
> >
> > We're rebooted, fsck'd the ext4 partitions, and finally ran the
> > dasalvager -force by hand to attempt to correctly salvage the server.
> > In all cases once the dafs instance starts up, it serves requests, it
> > dispatches a volume salvage or 4, all the salvager processes get stuck
> > and we start all over again.  We've salvaged the server multiple times
> > at this point -- our next hope is that we can restart the file server
> > with the traditional file server process.  (BTW, 2 and 3 GiB cores from
> > dafileserver and dasalvager abound.)
> >
> > SalsrvLog messages are usually along the following:
> >
> > 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> > 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
> > 'FSSYNC'; attempting reconnect to server
> > 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> > 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
> > 'FSSYNC'; attempting reconnect to server
> > 10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
> > errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597266)
> > 10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
> > 10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
> > offline failed; trying again...
> > 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> > 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
> > 'FSSYNC'; attempting reconnect to server
> > 10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
> > errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597265)
> > 10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
> > 10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
> > offline failed; trying again...
> > 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> >
> > or
> >
> > 10/18/2012 22:20:49 dispatching child to salvage volume 540007729...
> > 10/18/2012 22:19:33 SYNC_ask: No response on circuit 'FSSYNC'
> > 10/18/2012 22:19:33 SYNC_ask: protocol communications failure on circuit
> > 'FSSYNC'; attempting reconnect to server
> >
> > and from FileLog (this looks like I'm restoring from backups)
> >
> > Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> > (2574739029)
> > Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> > (3774863615)
> > Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> > (944130375)
> > Thu Oct 18 22:25:30 2012 Volume 539458481 now offline, must be salvaged.
> > Thu Oct 18 22:25:30 2012 Scheduling salvage for volume 539458481 on part
> > /vicepb over SALVSYNC
> > Thu Oct 18 22:25:31 2012 nUsers == 0, but header not on LRU
> > Thu Oct 18 22:25:31 2012 SYNC_getCom:  error receiving command
> > Thu Oct 18 22:25:31 2012 Scheduling salvage for volume 539894230 on part
> > /vicepb over SALVSYNC
> > Thu Oct 18 22:25:31 2012 FSYNC_com:  read failed; dropping connection
> > (cnt=103291)
> > Thu Oct 18 22:25:37 2012 FSYNC_com:  invalid protocol version
> > (2023862981)
> >
> > I've checked, all my binaries are from my 1.6.1 build.  What's going on?
> >
> > Jack Neely
> >
> 



-- 
Jack Neely <jjneely@ncsu.edu>
Linux Czar, OIT Campus Linux Services
Office of Information Technology, NC State University
GPG Fingerprint: 1917 5AC1 E828 9337 7AA4  EA6B 213B 765F 3B6A 5B89