[OpenAFS-Doc] Process signals -> updated fileserver man page - 2nd draft

Jason Edgecombe jason@rampaginggeek.com
Sat, 25 Aug 2007 17:32:34 -0400


This is a multi-part message in MIME format.
--------------030608020808090109060805
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Jeffrey Altman wrote:
> My guess is that 'lvldbs' and 'shortvldb' are Stanford specific scripts.
>  As such they shouldn't be referenced in a man page.
>
>   
You're right. Here is the second draft with that paragraph removed and 
including Andrew Deason's correction.

Thanks,
Jason

--------------030608020808090109060805
Content-Type: text/plain;
 name="fs-diff.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="fs-diff.txt"

? doc/man-pages/readpod
? doc/man-pages/pod1/vos_convertrotorw.pod
? doc/man-pages/pod8/read_tape.pod
Index: doc/man-pages/pod8/fileserver.pod
===================================================================
RCS file: /cvs/openafs/doc/man-pages/pod8/fileserver.pod,v
retrieving revision 1.7
diff -u -r1.7 fileserver.pod
--- doc/man-pages/pod8/fileserver.pod	12 Jun 2007 03:49:56 -0000	1.7
+++ doc/man-pages/pod8/fileserver.pod	25 Aug 2007 21:32:13 -0000
@@ -17,7 +17,7 @@
     S<<< [B<-cb> <I<number of call backs>>] >>> [B<-banner>] [B<-novbc>]
     S<<< [B<-implicit> <I<admin mode bits: rlidwka>>] >>> [B<-readonly>]
     S<<< [B<-hr> <I<number of hours between refreshing the host cps>>] >>>
-    [B<-busyat> <I<< redirect clients when queue > n >>>]
+    S<<< [B<-busyat> <I<< redirect clients when queue > n >>>] >>>
     [B<-nobusy>] S<<< [B<-rxpck> <I<number of rx extra packets>>] >>>
     [B<-rxdbg>] [B<-rxdbge>] S<<< [B<-rxmaxmtu> <I<bytes>>] >>>
     S<<< [B<-rxbind> <I<address to bind the Rx socket to>>] >>>
@@ -48,9 +48,9 @@
 
 The File Server creates the F</usr/afs/logs/FileLog> log file as it
 initializes, if the file does not already exist. It does not write a
-detailed trace by default, but use the B<-d> option to increase the amount
-of detail. Use the B<bos getlog> command to display the contents of the
-log file.
+detailed trace by default, but the B<-d> option may be used to
+increase the amount of detail. Use the B<bos getlog> command to
+display the contents of the log file.
 
 The command's arguments enable the administrator to control many aspects
 of the File Server's performance, as detailed in L<OPTIONS>.  By default
@@ -68,7 +68,7 @@
 
 The maximum number of lightweight processes (LWPs) the File Server uses to
 handle requests for data; corresponds to the B<-p> argument. The File
-Server always uses a minimum of 32 KB for these processes.
+Server always uses a minimum of 32 KB of memory for these processes.
 
 =item *
 
@@ -178,12 +178,12 @@
 
 =head1 CAUTIONS
 
-Do not use the B<-k> and -w arguments, which are intended for use by the
-AFS Development group only. Changing them from their default values can
-result in unpredictable File Server behavior.  In any case, on many
-operating systems the File Server uses native threads rather than the LWP
-threads, so using the B<-k> argument to set the number of LWP threads has
-no effect.
+Do not use the B<-k> and B<-w> arguments, which are intended for use
+by the AFS Development group only. Changing them from their default
+values can result in unpredictable File Server behavior.  In any case,
+on many operating systems the File Server uses native threads rather
+than the LWP threads, so using the B<-k> argument to set the number of
+LWP threads has no effect.
 
 Do not specify both the B<-spare> and B<-pctspare> arguments. Doing so
 causes the File Server to exit, leaving an error message in the
@@ -398,6 +398,163 @@
                 -cmd "/usr/afs/bin/fileserver -pctspare 10 \
                 -L" /usr/afs/bin/volserver /usr/afs/bin/salvager
 
+
+=head1 TROUBLESHOOTING
+
+Sending process signals to the File Server Process can change its
+behavior in the following ways:
+
+
+   Process          Signal       OS     Result
+   ---------------------------------------------------------------------
+
+   File Server      XCPU        Unix    Prints a list of client IP
+                                        Addresses.
+
+   File Server      USR2      Windows   Prints a list of client IP
+                                        Addresses.
+
+   File Server      POLL        HPUX    Prints a list of client IP
+                                        Addresses.
+
+   Any server       TSTP        Any     Increases Debug level by a power
+                                        of 5 -- 1,5,25,125, etc.
+                                        This has the same effect as the
+                                        S<<< B<-debug> <I<XXX>> >>>
+                                        command-line option.
+
+   Any Server       HUP         Any     Resets Debug level to 0
+
+   File Server      TERM        Any     Run minor instrumentation over
+                                        the list of descriptors.
+
+   Other Servers    TERM        Any     Causes the process to quit.
+
+   File Server      QUIT        Any     Causes the File Server to Quit.
+                                        Bos Server knows this.
+
+
+The basic metric of whether an AFS file server is doing well is its
+blocked connection count, which can be found by running the following
+command:
+
+   C</usr/afsws/etc/rxdebug> <I<server>> | grep waiting_for | wc -l
+
+Each line returned by C<rxdebug> that contains the text "waiting_for"
+represents a blocked conneciton.
+
+If the blocked connection count is ever above 0, the server is having
+problems replying to clients in a timely fashion.  If it gets above
+10, roughly, there will be noticable slowness by the user. The total
+number of connections is a mostly irrelevant number that goes
+essentially monotonically for as long as the server has been running
+and then goes back down to zero when it's restarted.
+
+The most common cause of blocked connections rising on a server is
+some process somewhere performing an abnormal number of accesses to
+that server and its volumes.  If multiple servers have a blocked
+connection count, the most likely explanation is that there is a
+volume replicated between those servers that is absorbing an
+abnormally high access rate.
+
+To get an access count on all the volumes on a server, run:
+
+   vos listvol <I<server>> -long
+
+and save the output in a file.  The results will look like a bunch of
+B<vos examine> output for each volume on the server.  Look for lines
+like:
+
+   40065 accesses in the past day (i.e., vnode references)
+
+and look for volumes with an abnormally high number of accesses.
+Anything over 10,000 is fairly high, but some core infrastructure
+volumes lie root.cell and other volumes close to the root of the cell
+will have that many hits routinely.  Anything over 100,000 is
+generally abnormally high.  The count resets about once a day.
+
+Another approach that can be used to narrow the possibilities for a
+replicated volume, when multiple servers are having trouble, is to
+find all replicated volumes for that server.  Run:
+
+   % vos listvldb -server <I<server>>
+
+where <I<server>> is one of the servers having problems to refresh the VLDB
+cache, and then run:
+
+   % vos listvldb -server <I<server>> -part <I<partition>>
+
+to get a list of all volumes on that server and partition, including
+every other server with replicas.
+
+Once the volume causing the problem has been identified, the best way to
+deal with the problem is to move that volume to another server with a low
+load.  Often the volume will be enough information to tell what's going on
+by scanning the cluster for scripts run by that user, if it's a user
+volume) or using that program, if it's a non-user volume.
+
+If you still need additional information about who's hitting that
+server, sometimes you can guess at that information from the failed
+callbacks in the F<FileLog> log in F</var/log/afs> on the server, or
+from the output of:
+
+   /usr/afsws/etc/rxdebug <I<server>> -rxstats
+
+but the best way is to turn on debugging output from the file server.
+(Warning: This generates a *lot* of output into FileLog on the AFS
+server.)  To do this, log on to the AFS server, find the PID of the
+fileserver process, and do:
+
+    kill -TSTP <I<pid of file server process>>
+
+This will raise the debugging level so that you'll start seeing what
+people are actually doing on the server.  You can do this up to three
+more times to get even more output if needed.  To reset the debugging
+level back to normal, use (The following command will NOT terminate
+the file server):
+
+    kill -HUP <I<pidof file server process>>
+
+The debugging setting on the File Server should be reset back to
+normal when debugging is no longer needed, otherwise the AFS
+server may well fill its disks with debugging output.
+
+The lines of the debugging output that are most useful for debugging
+load problems are:
+
+    SAFS_FetchStatus,  Fid = 2003828163.77154.82248, Host 171.64.15.76
+    SRXAFS_FetchData, Fid = 2003828163.77154.82248
+
+(The example above is partly truncated to highlight the interesting
+information).  The Fid identifies the volume and inode within the
+volume; the volume is the first long number.  So, for example, this
+was:
+
+   % vos examine 2003828163
+   pubsw.matlab61                   2003828163 RW    1040060 K  On-line
+       afssvr5.Stanford.EDU /vicepa 
+       RWrite 2003828163 ROnly 2003828164 Backup 2003828165 
+       MaxQuota    3000000 K 
+       Creation    Mon Aug  6 16:40:55 2001
+       Last Update Tue Jul 30 19:00:25 2002
+       86181 accesses in the past day (i.e., vnode references)
+
+       RWrite: 2003828163    ROnly: 2003828164    Backup: 2003828165
+       number of sites -> 3
+          server afssvr5.Stanford.EDU partition /vicepa RW Site 
+          server afssvr11.Stanford.EDU partition /vicepd RO Site 
+          server afssvr5.Stanford.EDU partition /vicepa RO Site 
+
+and from the Host information one can tell what system is accessing that
+volume.
+
+Note that the output of L<vos_examine(1)> also includes the access count,
+so once the problem has been identified, vos examine can be used to
+see if the access count is still increasing.  Also remember that you
+can run vos examine on the read-only replica, e.g.,
+pubsw.matlab61.readonly to see the access counts on the read-only
+replica on all of the servers that it's located on.
+
 =head1 PRIVILEGE REQUIRED
 
 The issuer must be logged in as the superuser C<root> on a file server
@@ -413,7 +570,8 @@
 L<bos_getlog(8)>,
 L<fs_setacl(1)>,
 L<salvager(8)>,
-L<volserver(8)>
+L<volserver(8)>,
+L<vos_examine(1)>
 
 =head1 COPYRIGHT
 

--------------030608020808090109060805--