[OpenAFS-Doc] Process signals -> updated fileserver man page
- 2nd draft
Jason Edgecombe
jason@rampaginggeek.com
Sat, 25 Aug 2007 17:32:34 -0400
This is a multi-part message in MIME format.
--------------030608020808090109060805
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Jeffrey Altman wrote:
> My guess is that 'lvldbs' and 'shortvldb' are Stanford specific scripts.
> As such they shouldn't be referenced in a man page.
>
>
You're right. Here is the second draft with that paragraph removed and
including Andrew Deason's correction.
Thanks,
Jason
--------------030608020808090109060805
Content-Type: text/plain;
name="fs-diff.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="fs-diff.txt"
? doc/man-pages/readpod
? doc/man-pages/pod1/vos_convertrotorw.pod
? doc/man-pages/pod8/read_tape.pod
Index: doc/man-pages/pod8/fileserver.pod
===================================================================
RCS file: /cvs/openafs/doc/man-pages/pod8/fileserver.pod,v
retrieving revision 1.7
diff -u -r1.7 fileserver.pod
--- doc/man-pages/pod8/fileserver.pod 12 Jun 2007 03:49:56 -0000 1.7
+++ doc/man-pages/pod8/fileserver.pod 25 Aug 2007 21:32:13 -0000
@@ -17,7 +17,7 @@
S<<< [B<-cb> <I<number of call backs>>] >>> [B<-banner>] [B<-novbc>]
S<<< [B<-implicit> <I<admin mode bits: rlidwka>>] >>> [B<-readonly>]
S<<< [B<-hr> <I<number of hours between refreshing the host cps>>] >>>
- [B<-busyat> <I<< redirect clients when queue > n >>>]
+ S<<< [B<-busyat> <I<< redirect clients when queue > n >>>] >>>
[B<-nobusy>] S<<< [B<-rxpck> <I<number of rx extra packets>>] >>>
[B<-rxdbg>] [B<-rxdbge>] S<<< [B<-rxmaxmtu> <I<bytes>>] >>>
S<<< [B<-rxbind> <I<address to bind the Rx socket to>>] >>>
@@ -48,9 +48,9 @@
The File Server creates the F</usr/afs/logs/FileLog> log file as it
initializes, if the file does not already exist. It does not write a
-detailed trace by default, but use the B<-d> option to increase the amount
-of detail. Use the B<bos getlog> command to display the contents of the
-log file.
+detailed trace by default, but the B<-d> option may be used to
+increase the amount of detail. Use the B<bos getlog> command to
+display the contents of the log file.
The command's arguments enable the administrator to control many aspects
of the File Server's performance, as detailed in L<OPTIONS>. By default
@@ -68,7 +68,7 @@
The maximum number of lightweight processes (LWPs) the File Server uses to
handle requests for data; corresponds to the B<-p> argument. The File
-Server always uses a minimum of 32 KB for these processes.
+Server always uses a minimum of 32 KB of memory for these processes.
=item *
@@ -178,12 +178,12 @@
=head1 CAUTIONS
-Do not use the B<-k> and -w arguments, which are intended for use by the
-AFS Development group only. Changing them from their default values can
-result in unpredictable File Server behavior. In any case, on many
-operating systems the File Server uses native threads rather than the LWP
-threads, so using the B<-k> argument to set the number of LWP threads has
-no effect.
+Do not use the B<-k> and B<-w> arguments, which are intended for use
+by the AFS Development group only. Changing them from their default
+values can result in unpredictable File Server behavior. In any case,
+on many operating systems the File Server uses native threads rather
+than the LWP threads, so using the B<-k> argument to set the number of
+LWP threads has no effect.
Do not specify both the B<-spare> and B<-pctspare> arguments. Doing so
causes the File Server to exit, leaving an error message in the
@@ -398,6 +398,163 @@
-cmd "/usr/afs/bin/fileserver -pctspare 10 \
-L" /usr/afs/bin/volserver /usr/afs/bin/salvager
+
+=head1 TROUBLESHOOTING
+
+Sending process signals to the File Server Process can change its
+behavior in the following ways:
+
+
+ Process Signal OS Result
+ ---------------------------------------------------------------------
+
+ File Server XCPU Unix Prints a list of client IP
+ Addresses.
+
+ File Server USR2 Windows Prints a list of client IP
+ Addresses.
+
+ File Server POLL HPUX Prints a list of client IP
+ Addresses.
+
+ Any server TSTP Any Increases Debug level by a power
+ of 5 -- 1,5,25,125, etc.
+ This has the same effect as the
+ S<<< B<-debug> <I<XXX>> >>>
+ command-line option.
+
+ Any Server HUP Any Resets Debug level to 0
+
+ File Server TERM Any Run minor instrumentation over
+ the list of descriptors.
+
+ Other Servers TERM Any Causes the process to quit.
+
+ File Server QUIT Any Causes the File Server to Quit.
+ Bos Server knows this.
+
+
+The basic metric of whether an AFS file server is doing well is its
+blocked connection count, which can be found by running the following
+command:
+
+ C</usr/afsws/etc/rxdebug> <I<server>> | grep waiting_for | wc -l
+
+Each line returned by C<rxdebug> that contains the text "waiting_for"
+represents a blocked conneciton.
+
+If the blocked connection count is ever above 0, the server is having
+problems replying to clients in a timely fashion. If it gets above
+10, roughly, there will be noticable slowness by the user. The total
+number of connections is a mostly irrelevant number that goes
+essentially monotonically for as long as the server has been running
+and then goes back down to zero when it's restarted.
+
+The most common cause of blocked connections rising on a server is
+some process somewhere performing an abnormal number of accesses to
+that server and its volumes. If multiple servers have a blocked
+connection count, the most likely explanation is that there is a
+volume replicated between those servers that is absorbing an
+abnormally high access rate.
+
+To get an access count on all the volumes on a server, run:
+
+ vos listvol <I<server>> -long
+
+and save the output in a file. The results will look like a bunch of
+B<vos examine> output for each volume on the server. Look for lines
+like:
+
+ 40065 accesses in the past day (i.e., vnode references)
+
+and look for volumes with an abnormally high number of accesses.
+Anything over 10,000 is fairly high, but some core infrastructure
+volumes lie root.cell and other volumes close to the root of the cell
+will have that many hits routinely. Anything over 100,000 is
+generally abnormally high. The count resets about once a day.
+
+Another approach that can be used to narrow the possibilities for a
+replicated volume, when multiple servers are having trouble, is to
+find all replicated volumes for that server. Run:
+
+ % vos listvldb -server <I<server>>
+
+where <I<server>> is one of the servers having problems to refresh the VLDB
+cache, and then run:
+
+ % vos listvldb -server <I<server>> -part <I<partition>>
+
+to get a list of all volumes on that server and partition, including
+every other server with replicas.
+
+Once the volume causing the problem has been identified, the best way to
+deal with the problem is to move that volume to another server with a low
+load. Often the volume will be enough information to tell what's going on
+by scanning the cluster for scripts run by that user, if it's a user
+volume) or using that program, if it's a non-user volume.
+
+If you still need additional information about who's hitting that
+server, sometimes you can guess at that information from the failed
+callbacks in the F<FileLog> log in F</var/log/afs> on the server, or
+from the output of:
+
+ /usr/afsws/etc/rxdebug <I<server>> -rxstats
+
+but the best way is to turn on debugging output from the file server.
+(Warning: This generates a *lot* of output into FileLog on the AFS
+server.) To do this, log on to the AFS server, find the PID of the
+fileserver process, and do:
+
+ kill -TSTP <I<pid of file server process>>
+
+This will raise the debugging level so that you'll start seeing what
+people are actually doing on the server. You can do this up to three
+more times to get even more output if needed. To reset the debugging
+level back to normal, use (The following command will NOT terminate
+the file server):
+
+ kill -HUP <I<pidof file server process>>
+
+The debugging setting on the File Server should be reset back to
+normal when debugging is no longer needed, otherwise the AFS
+server may well fill its disks with debugging output.
+
+The lines of the debugging output that are most useful for debugging
+load problems are:
+
+ SAFS_FetchStatus, Fid = 2003828163.77154.82248, Host 171.64.15.76
+ SRXAFS_FetchData, Fid = 2003828163.77154.82248
+
+(The example above is partly truncated to highlight the interesting
+information). The Fid identifies the volume and inode within the
+volume; the volume is the first long number. So, for example, this
+was:
+
+ % vos examine 2003828163
+ pubsw.matlab61 2003828163 RW 1040060 K On-line
+ afssvr5.Stanford.EDU /vicepa
+ RWrite 2003828163 ROnly 2003828164 Backup 2003828165
+ MaxQuota 3000000 K
+ Creation Mon Aug 6 16:40:55 2001
+ Last Update Tue Jul 30 19:00:25 2002
+ 86181 accesses in the past day (i.e., vnode references)
+
+ RWrite: 2003828163 ROnly: 2003828164 Backup: 2003828165
+ number of sites -> 3
+ server afssvr5.Stanford.EDU partition /vicepa RW Site
+ server afssvr11.Stanford.EDU partition /vicepd RO Site
+ server afssvr5.Stanford.EDU partition /vicepa RO Site
+
+and from the Host information one can tell what system is accessing that
+volume.
+
+Note that the output of L<vos_examine(1)> also includes the access count,
+so once the problem has been identified, vos examine can be used to
+see if the access count is still increasing. Also remember that you
+can run vos examine on the read-only replica, e.g.,
+pubsw.matlab61.readonly to see the access counts on the read-only
+replica on all of the servers that it's located on.
+
=head1 PRIVILEGE REQUIRED
The issuer must be logged in as the superuser C<root> on a file server
@@ -413,7 +570,8 @@
L<bos_getlog(8)>,
L<fs_setacl(1)>,
L<salvager(8)>,
-L<volserver(8)>
+L<volserver(8)>,
+L<vos_examine(1)>
=head1 COPYRIGHT
--------------030608020808090109060805--