[OpenAFS-Doc] updated fileserver.pod
Jason Edgecombe
jason@rampaginggeek.com
Wed, 29 Aug 2007 22:19:32 -0400
This is a multi-part message in MIME format.
--------------020503090305030003080201
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Hi,
Here is my patch for the fileserver.pod page. I added a blurb about
volumes and /vicepX/AlwaysAttach
Jason
--------------020503090305030003080201
Content-Type: text/plain;
name="diff.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="diff.txt"
? doc/man-pages/readpod
? doc/man-pages/pod1/vos_convertrotorw.pod
? doc/man-pages/pod1/vos_copy.pod
? doc/man-pages/pod8/read_tape.pod
Index: doc/man-pages/pod8/fileserver.pod
===================================================================
RCS file: /cvs/openafs/doc/man-pages/pod8/fileserver.pod,v
retrieving revision 1.7
diff -u -r1.7 fileserver.pod
--- doc/man-pages/pod8/fileserver.pod 12 Jun 2007 03:49:56 -0000 1.7
+++ doc/man-pages/pod8/fileserver.pod 30 Aug 2007 02:16:31 -0000
@@ -17,7 +17,7 @@
S<<< [B<-cb> <I<number of call backs>>] >>> [B<-banner>] [B<-novbc>]
S<<< [B<-implicit> <I<admin mode bits: rlidwka>>] >>> [B<-readonly>]
S<<< [B<-hr> <I<number of hours between refreshing the host cps>>] >>>
- [B<-busyat> <I<< redirect clients when queue > n >>>]
+ S<<< [B<-busyat> <I<< redirect clients when queue > n >>>] >>>
[B<-nobusy>] S<<< [B<-rxpck> <I<number of rx extra packets>>] >>>
[B<-rxdbg>] [B<-rxdbge>] S<<< [B<-rxmaxmtu> <I<bytes>>] >>>
S<<< [B<-rxbind> <I<address to bind the Rx socket to>>] >>>
@@ -48,9 +48,9 @@
The File Server creates the F</usr/afs/logs/FileLog> log file as it
initializes, if the file does not already exist. It does not write a
-detailed trace by default, but use the B<-d> option to increase the amount
-of detail. Use the B<bos getlog> command to display the contents of the
-log file.
+detailed trace by default, but the B<-d> option may be used to
+increase the amount of detail. Use the B<bos getlog> command to
+display the contents of the log file.
The command's arguments enable the administrator to control many aspects
of the File Server's performance, as detailed in L<OPTIONS>. By default
@@ -68,7 +68,7 @@
The maximum number of lightweight processes (LWPs) the File Server uses to
handle requests for data; corresponds to the B<-p> argument. The File
-Server always uses a minimum of 32 KB for these processes.
+Server always uses a minimum of 32 KB of memory for these processes.
=item *
@@ -168,6 +168,16 @@
that it can take that long for changed group memberships to become
effective. To change this frequency, use the B<-hr> argument.
+The File Server stores volumes in partitions. A partition is a
+filesystem or directory on the server machine that is named C</vicepX>
+or C</vicepXX> where XX is "a" through "z" or "aa" though "zz". The
+File Server expects that the /vicepXX directories are each on a
+dedicated filesystem. The File Server will only use a /vicepXX if it's
+a mountpoint for another filesystem, unless the file
+C</vicepXX/AlwaysAttach> exists. The data in the partition is a
+special format that can only be access using OpenAFS commands or an
+OpenAFS client.
+
The File Server generates the following message when a partition is nearly
full:
@@ -178,12 +188,12 @@
=head1 CAUTIONS
-Do not use the B<-k> and -w arguments, which are intended for use by the
-AFS Development group only. Changing them from their default values can
-result in unpredictable File Server behavior. In any case, on many
-operating systems the File Server uses native threads rather than the LWP
-threads, so using the B<-k> argument to set the number of LWP threads has
-no effect.
+Do not use the B<-k> and B<-w> arguments, which are intended for use
+by the AFS Development group only. Changing them from their default
+values can result in unpredictable File Server behavior. In any case,
+on many operating systems the File Server uses native threads rather
+than the LWP threads, so using the B<-k> argument to set the number of
+LWP threads has no effect.
Do not specify both the B<-spare> and B<-pctspare> arguments. Doing so
causes the File Server to exit, leaving an error message in the
@@ -398,6 +408,163 @@
-cmd "/usr/afs/bin/fileserver -pctspare 10 \
-L" /usr/afs/bin/volserver /usr/afs/bin/salvager
+
+=head1 TROUBLESHOOTING
+
+Sending process signals to the File Server Process can change its
+behavior in the following ways:
+
+
+ Process Signal OS Result
+ ---------------------------------------------------------------------
+
+ File Server XCPU Unix Prints a list of client IP
+ Addresses.
+
+ File Server USR2 Windows Prints a list of client IP
+ Addresses.
+
+ File Server POLL HPUX Prints a list of client IP
+ Addresses.
+
+ Any server TSTP Any Increases Debug level by a power
+ of 5 -- 1,5,25,125, etc.
+ This has the same effect as the
+ S<<< B<-debug> <I<XXX>> >>>
+ command-line option.
+
+ Any Server HUP Any Resets Debug level to 0
+
+ File Server TERM Any Run minor instrumentation over
+ the list of descriptors.
+
+ Other Servers TERM Any Causes the process to quit.
+
+ File Server QUIT Any Causes the File Server to Quit.
+ Bos Server knows this.
+
+
+The basic metric of whether an AFS file server is doing well is its
+blocked connection count, which can be found by running the following
+command:
+
+ C</usr/afsws/etc/rxdebug> <I<server>> | grep waiting_for | wc -l
+
+Each line returned by C<rxdebug> that contains the text "waiting_for"
+represents a blocked conneciton.
+
+If the blocked connection count is ever above 0, the server is having
+problems replying to clients in a timely fashion. If it gets above
+10, roughly, there will be noticable slowness by the user. The total
+number of connections is a mostly irrelevant number that goes
+essentially monotonically for as long as the server has been running
+and then goes back down to zero when it's restarted.
+
+The most common cause of blocked connections rising on a server is
+some process somewhere performing an abnormal number of accesses to
+that server and its volumes. If multiple servers have a blocked
+connection count, the most likely explanation is that there is a
+volume replicated between those servers that is absorbing an
+abnormally high access rate.
+
+To get an access count on all the volumes on a server, run:
+
+ vos listvol <I<server>> -long
+
+and save the output in a file. The results will look like a bunch of
+B<vos examine> output for each volume on the server. Look for lines
+like:
+
+ 40065 accesses in the past day (i.e., vnode references)
+
+and look for volumes with an abnormally high number of accesses.
+Anything over 10,000 is fairly high, but some core infrastructure
+volumes lie root.cell and other volumes close to the root of the cell
+will have that many hits routinely. Anything over 100,000 is
+generally abnormally high. The count resets about once a day.
+
+Another approach that can be used to narrow the possibilities for a
+replicated volume, when multiple servers are having trouble, is to
+find all replicated volumes for that server. Run:
+
+ % vos listvldb -server <I<server>>
+
+where <I<server>> is one of the servers having problems to refresh the VLDB
+cache, and then run:
+
+ % vos listvldb -server <I<server>> -part <I<partition>>
+
+to get a list of all volumes on that server and partition, including
+every other server with replicas.
+
+Once the volume causing the problem has been identified, the best way to
+deal with the problem is to move that volume to another server with a low
+load. Often the volume will be enough information to tell what's going on
+by scanning the cluster for scripts run by that user, if it's a user
+volume) or using that program, if it's a non-user volume.
+
+If you still need additional information about who's hitting that
+server, sometimes you can guess at that information from the failed
+callbacks in the F<FileLog> log in F</var/log/afs> on the server, or
+from the output of:
+
+ /usr/afsws/etc/rxdebug <I<server>> -rxstats
+
+but the best way is to turn on debugging output from the file server.
+(Warning: This generates a *lot* of output into FileLog on the AFS
+server.) To do this, log on to the AFS server, find the PID of the
+fileserver process, and do:
+
+ kill -TSTP <I<pid of file server process>>
+
+This will raise the debugging level so that you'll start seeing what
+people are actually doing on the server. You can do this up to three
+more times to get even more output if needed. To reset the debugging
+level back to normal, use (The following command will NOT terminate
+the file server):
+
+ kill -HUP <I<pidof file server process>>
+
+The debugging setting on the File Server should be reset back to
+normal when debugging is no longer needed, otherwise the AFS
+server may well fill its disks with debugging output.
+
+The lines of the debugging output that are most useful for debugging
+load problems are:
+
+ SAFS_FetchStatus, Fid = 2003828163.77154.82248, Host 171.64.15.76
+ SRXAFS_FetchData, Fid = 2003828163.77154.82248
+
+(The example above is partly truncated to highlight the interesting
+information). The Fid identifies the volume and inode within the
+volume; the volume is the first long number. So, for example, this
+was:
+
+ % vos examine 2003828163
+ pubsw.matlab61 2003828163 RW 1040060 K On-line
+ afssvr5.Stanford.EDU /vicepa
+ RWrite 2003828163 ROnly 2003828164 Backup 2003828165
+ MaxQuota 3000000 K
+ Creation Mon Aug 6 16:40:55 2001
+ Last Update Tue Jul 30 19:00:25 2002
+ 86181 accesses in the past day (i.e., vnode references)
+
+ RWrite: 2003828163 ROnly: 2003828164 Backup: 2003828165
+ number of sites -> 3
+ server afssvr5.Stanford.EDU partition /vicepa RW Site
+ server afssvr11.Stanford.EDU partition /vicepd RO Site
+ server afssvr5.Stanford.EDU partition /vicepa RO Site
+
+and from the Host information one can tell what system is accessing that
+volume.
+
+Note that the output of L<vos_examine(1)> also includes the access count,
+so once the problem has been identified, vos examine can be used to
+see if the access count is still increasing. Also remember that you
+can run vos examine on the read-only replica, e.g.,
+pubsw.matlab61.readonly to see the access counts on the read-only
+replica on all of the servers that it's located on.
+
=head1 PRIVILEGE REQUIRED
The issuer must be logged in as the superuser C<root> on a file server
@@ -413,7 +580,8 @@
L<bos_getlog(8)>,
L<fs_setacl(1)>,
L<salvager(8)>,
-L<volserver(8)>
+L<volserver(8)>,
+L<vos_examine(1)>
=head1 COPYRIGHT
--------------020503090305030003080201--