[OpenAFS] Weird volserver problem

Brian Sebby sebby@anl.gov
Sat, 28 Jul 2007 16:10:56 -0500


We're having a strange problem that just started happening this afternoon
on one of our fileservers that appears to be related to the volserver.

We have a number of jobs that perform vos release commands, and today we
started getting error messages from them indicating that they were timing
out, etc.  Trying to run various "vos" commands takes forever, and although
they eventually return the information, they sit there for several minutes
before they succeed.

I'm seeing a number of messages like this in the VolserLog file:

Sat Jul 28 16:02:11 2007 trans 60 on volume 1818569609 has been idle for more than 570 seconds
Sat Jul 28 16:02:11 2007 trans 55 on volume 1818569660 has been idle for more than 600 seconds
Sat Jul 28 16:02:11 2007 trans 55 on volume 1818569660 has timed out
Sat Jul 28 16:02:41 2007 trans 60 on volume 1818569609 has been idle for more than 600 seconds
Sat Jul 28 16:02:41 2007 trans 60 on volume 1818569609 has timed out

and 

Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, error
 -01
Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, error
 -01
Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, error
 -01
Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, error
 -01
Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, error
 -01
Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, error
 -01
Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, error
 -01
Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, error
 -01

These volumes are on SAN storage, using ZFS as the backend fileserver.
We're running the 1.4.4 namei fileserver on Solaris with the -nofsync patch.

Here are the bos parameters we're using:

Instance fs, (type is fs) currently running normally.
    Auxiliary status is: file server running.
    Process last started at Sat Jul 28 15:50:38 2007 (3 proc starts)
    Last exit at Sat Jul 28 15:50:38 2007
    Command 1 is '/usr/afs/bin/fileserver -nojumbo -nofsync'
    Command 2 is '/usr/afs/bin/volserver -nojumbo -nofsync'
    Command 3 is '/usr/afs/bin/salvager'

Any help would be greatly appreciated.


Brian

-- 
Brian Sebby  (sebby@anl.gov)  |  Unix and Operation Services
Phone: +1 630.252.9935        |  Computing and Information Systems
Fax:   +1 630.252.4601        |  Argonne National Laboratory