[OpenAFS] Weird volserver problem

Brian Sebby sebby@anl.gov
Mon, 30 Jul 2007 18:13:40 -0500


Thanks for the information.  We applied the patch to our server code and are
now running that on our fileservers.

I have done some more research, and have been able to trace the problem to
a specific activity - doing vos dumps.  Unfortunately, I don't have any
debugging information, but I would like to know if anyone else has seen
problems like this.

On Saturday, one of our systems began to do remote vos dumps of all of our
AFS data for disaster recovery purposes.  It has been doing this for almost
a year now, but this was the first weekend that it was doing that since
I migrated our AFS servers to the new servers that I set up.  Before, we
had been running the 1.2.11 code (with the big ticket patch) on our
fileservers.  We never saw any problems during the vos dumps against those
servers.  It was not until we started doing the dumps against the new 1.4.4
servers that the vos problems emerged.

As I mentioned, the symptoms of the problem is that all vos operations - 
vos examine, vos release, etc, were extremely slow.  It would take several
minutes for any of the commands to return, even for volumes that weren't
being backed up at that time (but which were on the same server).  This
problem seemed to only affect the volserver - I was able to read and write
files from the affected volumes with no problems.

Unfortunately, I killed the remote backup job prior to installing the
volserver clone patch, so I'm not sure if that would have fixed the problem.
I'm hesitant to start the remote job again since I don't want to break my
AFS servers.

We also run a number of vos dump jobs on the new servers that dump the
local volumes to the local disk every night.  During the times these
dumps are running, I have seen a number of messages like the following 
in the VolserLog:

Mon Jul 30 07:04:39 2007 trans 18756 on volume 1818562471 is older than 1350 seconds
Mon Jul 30 07:05:09 2007 trans 18756 on volume 1818562471 is older than 1380 seconds
Mon Jul 30 07:05:39 2007 trans 18756 on volume 1818562471 is older than 1410 seconds

During the remote backup, I also saw the following errors.  I have not seen
these during the time the local vos dumps were being performed:

Sat Jul 28 19:39:46 2007 trans 1861 on volume 1818558958 has timed out
Sat Jul 28 19:39:46 2007 trans 1851 on volume 537038355 has timed out
Sat Jul 28 19:41:16 2007 trans 1875 on volume 536872291 has timed out

Is this something new in 1.4.4?  As I said, we never saw this problem with
our old 1.2.11 fileservers.

Thanks again to Derrick and others for help; I really appreciate it.


Brian

On Sat, Jul 28, 2007 at 05:18:43PM -0400, Derrick J Brashear wrote:
> you probably want the volserver clone locking patch in (i'm guessing)
> src/vol/clone.c since 1.4.4
> 
> On Sat, 28 Jul 2007, Brian Sebby wrote:
> 
> >We're having a strange problem that just started happening this afternoon
> >on one of our fileservers that appears to be related to the volserver.
> >
> >We have a number of jobs that perform vos release commands, and today we
> >started getting error messages from them indicating that they were timing
> >out, etc.  Trying to run various "vos" commands takes forever, and although
> >they eventually return the information, they sit there for several minutes
> >before they succeed.
> >
> >I'm seeing a number of messages like this in the VolserLog file:
> >
> >Sat Jul 28 16:02:11 2007 trans 60 on volume 1818569609 has been idle for 
> >more than 570 seconds
> >Sat Jul 28 16:02:11 2007 trans 55 on volume 1818569660 has been idle for 
> >more than 600 seconds
> >Sat Jul 28 16:02:11 2007 trans 55 on volume 1818569660 has timed out
> >Sat Jul 28 16:02:41 2007 trans 60 on volume 1818569609 has been idle for 
> >more than 600 seconds
> >Sat Jul 28 16:02:41 2007 trans 60 on volume 1818569609 has timed out
> >
> >and
> >
> >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, 
> >error
> >-01
> >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, 
> >error
> >-01
> >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, 
> >error
> >-01
> >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, 
> >error
> >-01
> >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, 
> >error
> >-01
> >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, 
> >error
> >-01
> >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, 
> >error
> >-01
> >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, 
> >error
> >-01
> >
> >These volumes are on SAN storage, using ZFS as the backend fileserver.
> >We're running the 1.4.4 namei fileserver on Solaris with the -nofsync 
> >patch.
> >
> >Here are the bos parameters we're using:
> >
> >Instance fs, (type is fs) currently running normally.
> >   Auxiliary status is: file server running.
> >   Process last started at Sat Jul 28 15:50:38 2007 (3 proc starts)
> >   Last exit at Sat Jul 28 15:50:38 2007
> >   Command 1 is '/usr/afs/bin/fileserver -nojumbo -nofsync'
> >   Command 2 is '/usr/afs/bin/volserver -nojumbo -nofsync'
> >   Command 3 is '/usr/afs/bin/salvager'
> >
> >Any help would be greatly appreciated.
> >
> >
> >Brian
> >
> >-- 
> >Brian Sebby  (sebby@anl.gov)  |  Unix and Operation Services
> >Phone: +1 630.252.9935        |  Computing and Information Systems
> >Fax:   +1 630.252.4601        |  Argonne National Laboratory
> >_______________________________________________
> >OpenAFS-info mailing list
> >OpenAFS-info@openafs.org
> >https://lists.openafs.org/mailman/listinfo/openafs-info
> >
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info

-- 
Brian Sebby  (sebby@anl.gov)  |  Unix and Operation Services
Phone: +1 630.252.9935        |  Computing and Information Systems
Fax:   +1 630.252.4601        |  Argonne National Laboratory