[OpenAFS] when clients come up faster than servers

Jeffrey Hutzelman jhutz@cmu.edu
Mon, 17 Nov 2003 20:20:21 -0500


--==========1900119384==========
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline



On Monday, November 17, 2003 14:34:47 -0600 steve rader 
<rader@ginseng.hep.wisc.edu> wrote:

>
> When my power fails, most of my (linux) clients come up before
> my servers, and so I have to manually reboot clients to get
> their afs going.
>
> Is there some way to handle this problem gracefully?  It seems
> that a reasonable solution would be to delay starting afs (on
> clients) until afs service (server-wise) is up and running.
> Has any one implemented such a thing?  Other ideas?

As a matter of fact, yes.  Long ago we had a machine (a sun3_35 box, I 
think) which was both a fileserver and an AFS client.  In those days, disk 
was expensive and everything was slow.  The result was that this machine 
needed to wait until its own fileserver had started before starting the 
cache manager; otherwise it would time itself out and proceed with the 
system startup process minus the contents of most of /usr (which came from 
AFS).

The result was a tool called 'fsping', which could do more or less exactly 
what you describe.  It's fairly old, but it still builds, and I've attached 
a copy.  Particularly note the '-w' switch, which will cause it to exit as 
soon as the fileserver responds.

Note that all this really does is allow you to wait until a particular 
fileserver is up, or the retry count expires.  If that fileserver happens 
to fail to come up after the power failure, you have a problem.

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA

--==========1900119384==========
Content-Type: text/plain; charset=us-ascii; name="fsping.c"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="fsping.c"; size=3196

/* fsping - Ping an AFS fileserver
 *
 * usage: fsping [-d delay] [-r retry] [-qvw] server [port]
 *   -d   Set delay between retries (default 10 sec)
 *   -q   Set quiet mode (no results)
 *   -r   Set max number of tries (default 1)
 *   -v   Turn on fsprobe debugging output
 *   -w   Wait for server startup only
 */

#include <sys/types.h>
#include <afs/fsprobe.h>
#include <netinet/in.h>
#include <netdb.h>
#include <stdio.h>

extern struct hostent *hostutil_GetHostByName();
extern int optind, opterr;
extern char *optarg;

static int waiting = 0, quiet = 0;
static int retry = 0, ping_count = 0, ok_count = 0;
static char *server;


/* complete() - Print status and exit */
static void complete()
{
  if (!quiet)
    {
      printf("Server %s is %s.\n", server, ok_count ? "alive" : "dead");
      if (ok_count)
	printf("Server answered %d out of %d probes\n",
	       ok_count, ping_count);
    }
  fsprobe_Cleanup(1);
  rx_Finalize();
  exit (!ok_count);
}


/* usage() - print usage message and exit */
static void usage(char *progname)
{
  fprintf(stderr, "usage: %s [-d delay] [-r retry] [-qvw] server [port]\n",
	  progname);
  fprintf(stderr, "  -d   Set delay between retries (default 10 sec)\n");
  fprintf(stderr, "  -q   Set quiet mode (no results)\n");
  fprintf(stderr, "  -r   Set max number of tries (default 1)\n");
  fprintf(stderr, "  -v   Turn on fsprobe debugging output\n");
  fprintf(stderr, "  -w   Wait for server startup only\n");
  exit(-1);
}


/* fsHander - handler routine passed tp fsprobe
 * We don't actually do anything with the data returned
 * by the probe, so this doesn't have to do very much.   */
static int fsHandler(void)
{
  ping_count = fsprobe_Results.probeNum;
  if (!*fsprobe_Results.probeOK)
    {
      ok_count++;
      if (waiting) complete();
    }
  if (ping_count == retry) complete();
  return 0;
}


int main(int argc, char **argv)
{
  int delay = 10, retry = 1, verbose = 0, c;
  int port = 7000;  /* AFS fileserver */
  struct sockaddr_in server_addr;
  struct timeval tv;
  struct hostent *he;
  char *x;

  opterr = 0;
  while ((c = getopt(argc, argv, "d:qr:vw")) >= 0) switch (c)
    {
    case 'q': quiet   = 1; continue;
    case 'v': verbose = 1; continue;
    case 'w': waiting = 1; continue;
    case 'd':
      delay = strtol(optarg, &x, 10);
      if (x == optarg) usage(argv[0]);
      continue;
    case 'r':
      retry = strtol(optarg, &x, 10);
      if (x == optarg) usage(argv[0]);
      continue;
    default: usage(argv[0]);
    }
  if (optind != argc - 1)
    usage(argv[0]);
  server = argv[optind];

  server_addr.sin_family = AF_INET;
  server_addr.sin_port   = htons(port);
  if (!(he = hostutil_GetHostByName(server)))
    {
      fprintf(stderr, "%s: Can't get host info for %s\n", argv[0], server);
      exit(-1);
    }
  bcopy(he->h_addr, &server_addr.sin_addr.s_addr, 4);
  c = fsprobe_Init(1, &server_addr, delay, fsHandler, verbose);
  if (c)
    {
      fprintf(stderr, "%s: fsprobe_Init failed (%d)\n", argv[0], c);
      fsprobe_Cleanup(1);
      exit(-1);
    }

  for (;;)
    {
      tv.tv_sec = 3600;
      tv.tv_usec = 0;
      if (IOMGR_Select(0, 0, 0, 0, &tv)) break;
    }
  complete();
}

--==========1900119384==========
Content-Type: text/plain; charset=iso-8859-1;
 name="fsping compilation command"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="fsping compilation command";
 size=270

# This assumes AFS is installed in /usr/local
gcc -g -I/usr/local/include -o fsping fsping.o -L/usr/local/lib/afs =
-L/usr/local/lib -lfsprobe -lvolser -lvldb -lubik -lauth -lcmd -lrxkad =
-ldes -lcom_err -lkauth -lafsint -lrx -llwp -lsys /usr/local/lib/afs/util.a =
-lresolv

--==========1900119384==========--