[OpenAFS-devel] Ubik threading analysis

Jeffrey Hutzelman jhutz@cmu.edu
Wed, 16 Feb 2011 19:32:27 -0500


A while ago, Jeff Altman asked me to do an analysis of the Ubik code with 
repsect to threading, to try to determine how ready the code is to be moved 
from an LWP-based environment to one using preemptive POSIX threads, and 
what it will take to get it the rest of the way there.  Now that the work 
is complete, I'm posting the complete analysis and my recommendations for 
discussion and as a backdrop for changes I expect to see proposed in the 
near future.

This work was funded by Your File System, Inc.

-- Jeff



INTRODUCTION

   This document describes the structure of Ubik, with an eye toward
   thread-safety and concurrency.  It begins by discussing the elements,
   usage, and locking considerations of major shared data structures.  This
   is followed by a discussion of each major subsystem.  The emphasis is on
   deficiencies in thread-safety and the changes needed to correct them.

   Most of this document focuses on the user-mode Ubik server code, which
   comprises the bulk of the Ubik library code and also of the complexity.
   A separate section near the end is dedicated to Ubik client code, which
   itself is intended primarily for use in user-mode applications.  The
   OpenAFS cache manager includes its own mechanisms for accessing
   replicated services, including the Ubik-managed VLDB, and for tracking
   the servers that provide them.

   Occasionally, issues related to concurrency, performance, and modularity
   (particularly as regards support for supporting multiple independent
   databases in a single server process) are also discussed; however, these
   discussions are peripheral and are included only to record data obtained
   while examining the question of thread-safety and as a basis for future
   work.  Similarly, throughout this document are detailed discussions of
   some parts of the code and certain data structures which, while not
   always directly relevant to thread-safety, were produced in the course
   of analyzing the Ubik library code.  These are included to provide
   additional background and as a basis for future work in documentation of
   Ubik library internals and interfaces.


MAJOR DATA STRUCTURES

   This section calls out the major shared data structures used in the Ubik
   server code and details what they are used for and how shared access to
   them is protected.  Much of this text is probably suitable as a starting
   point for expanding on the documentation found in comments in ubik.p.h.

   struct ubik_dbase - database

      This structure contains nearly everything Ubik knows about an open
      database (for an explanation of "nearly", see the section below on
      globals), including database-wide parameters and state and a list of
      active transactions.

      The following structure elements are set only when the database
      structure is being initialized, after which they are never modified:

         + pathName - The base path used to construct database filenames
         + physical later method pointers (read, write, truncate, sync,
           stat, open, setlabel, getlabel, getnfiles), used to manipulate
           the physical database files on disk.  These currently always
           point to the methods defined in phys.c, and in fact other
           modules contain knowledge of the internal implementation of that
           module.

      The primary synchronization mechanism used to protect this structure
      is the database lock (versionLock), which is manipulated via the
      DBHOLD() and DBRELE() macros.  This lock protects the following
      structure elements, as well as a number of globals as described
      elsewhere in this document:

         + activeTrans - list of active transactions
         + version - database version (*)
         + tidCounter - transaction ID of latest transaction (*)
         + writeTidCounter - transaction ID of latest write transaction (*)
         + flags (*)
         + readers

      (*) These fields are currently protected by the database lock, except
      that the beacon thread does not hold that lock when examining them,
      or the global 'ubik_epochTime', which holds the corresponding epoch.
      The recommendation is for these fields, along with ubik_epochTime, to
      be additionally protected by a second lock, allowing the beacon
      thread to examine them without holding the database lock; see the
      section on the BEACON package for details.

      The condition variable 'version_cond' is used to signal to that the
      database version may have changed; it is broadcast in udisk_commit(),
      in SDISK_SendFile(), and from the recovery thread; it is monitored by
      ubik_WaitVersion(), which can be called by an application to wait for
      a database version change (this is not currently used in OpenAFS).
      This CV is associated with the database lock.  When LWP is used, this
      condition is signalled on &ubik_dbase->version.

      The condition variable 'flags_cond' is used by udisk_end() to signal
      that DBWRITING flag has been cleared.  This wakes threads waiting in
      ubik.c:BeginTrans() to begin a new transaction.  This CV is
      associated with the database lock.  When LWP is used, this condition
      is signalled on &ubik_dbase->flags.

      When LWP is used, signals are also sometimes sent on &urecovery_state
      and on &ubik_amSyncSite.  However, nothing ever waits on these, and
      no corresponding condition variables exist when pthreads is used.

      The 'cachedVersion' field holds the database version represented by
      the application's cache; this is compared against the current
      database version to discover if the cache is up to date.  Both
      cachedVersion and the application cache itself are protected by
      cache_lock, an AFS reader/writer lock.

      When the application wishes to use the cache, it calls
      ubik_CacheUpdate(), which verifies that the cache is current,
      updating it if necessary (under a write lock), and then returns with
      the application cache read-locked.  The application can then rely on
      the cache's contents not to change until it ends the transaction by
      calling ubik_EndTrans() or ubik_AbortTrans().  When one of these
      functions is called, the read lock is dropped, and a write lock may
      be acquired temporarily to update both cachedVersion and the cache
      (in the case of committing a write transaction) or to clear
      cachedVersion, indicating the cache is invalid (in the case of
      aborting a transaction).

      In the case of ubik_EndTrans() on a write transaction, the cache is
      held write-locked until udisk_commit() has completed; this is also
      done in SDISK_Commit(), which insures that the contents of the
      on-disk database and page cache cannot change while an active
      transaction holds the cache lock.  Note that both the recovery thread
      and SDISK_SendFile() may replace the database contents wholesale;
      however, these operations first abort all outstanding transactions.

      Prior to the introduction of ubik_BeginTransReadAnyWrite(), an
      application could choose to manage its own cache, updating it during
      write transactions and when, at the beginning of a read transaction,
      it was discovered to be out of date (due to a write done by another
      server).  Under this model, the application uses Ubik transaction
      locks to protect not only the database but also its cache.  However,
      the use of ubik_BeginTransReadAnyWrite() allows some transactions to
      proceed and read the database without acquiring any locks, which
      makes it unsafe to depend on Ubik transaction locks to protect the
      application cache.  Applications which make use of this mode must set
      ubik_SyncWriterCacheProc to point to a procedure to be called to
      update the cache during ubik_EndTrans(), after udisk_commit() has
      succeeded and before the cache_lock is released.  These applications
      must then refrain from ever updating the cache other than during that
      procedure or during the callback passed to ubik_CheckCache().


   struct ubik_trans - active transaction

      This structure represents the state of an active transaction,
      including transaction metadata and state and a list of all changes
      made as part of the transaction.

      Transactions are organized into a linked list referenced by the
      corresponding database structure, protected by the database lock.
      Each transaction contains an upward pointer to the database,
      a transaction type, and a transaction ID, all of which are immutable
      for as long as the transaction exists.

      A transaction created via the REMOTE interface exists until ended by
      a call to SDISK_Abort() or SDISK_ReleaseLocks() or aborted in
      urecovery_CheckTid() due to a transaction ID mismatch or creation of
      a new transaction (note that in this last case, cleanup will be
      deferred if SDISK_Lock() is blocked on this transaction).  In these
      cases, ubik_currentTrans is cleared at the same time, under the
      database lock, and further REMOTE calls will fail gracefully until
      a new transaction is started.  Thus, REMOTE transactions may be
      considered valid only for as long as the database lock is held.

      Transactions created via ubik_BeginTrans(), ubik_BeginTransReadAny(),
      or ubik_BeginTransReadAnyWrite() exist until ended by a call to
      ubik_AbortTrans() or ubik_EndTrans().  They are never destroyed from
      within the UBIK package.  Code which discovers a transaction by
      traversing the database's active transaction list may access the
      transaction only as long as the database lock is held, since once it
      is released, an application thread may end the transaction.

      Each transaction contains the following fields, which are protected
      by the database lock:

         + locktype - type of lock held by this transaction
         + minCommitTime - unused
         + seekFile - file number of database file pointer
         + seekPos - position of database file pointer
         + flags

      File writes are represented by 'iovec_info', an XDR vector structure
      (type iovec_wrt, a vector of struct ubik_iovec) containing the file,
      position, and length of each write, and 'iovec_data', an XDR opaque
      containing the corresponding data.  These vectors and their contents
      are protected by the database lock.

      File truncations are represented by a linked list of struct
      ubik_trunc, each of which contains a file number and the new size of
      that file.  This list, and the records contained in it, are protected
      by the database lock.

      While transactions are currently protected by the database lock, it
      may be desirable in the future to introduce a per-transaction lock,
      in order to facilitate increased concurrency.


   struct ubik_server - server status

      This structure is used to track the set of peer servers, including
      state related to elections and recovery.  The global 'ubik_servers'
      points to a linked list of these structures, which is constructed
      during initialization.  Once the list has been constructed, no
      entries are ever added or removed; in addition, the following fields
      are not changed after startup:

         + addr[0] - server's primary address
         + magic - true if this is the magic server
         + isClone - true if this server is a clone

      The following fields are used by the BEACON package to track beacons
      to this server and their results.  Currently, they are sometimes
      accessed under the database lock, and sometimes without benefit of
      any lock.  They should probably be protected by the same lock as
      globals private to the BEACON package.

         + lastVoteTime - last time this server voted yes for us
         + lastBeaconSent - last time a beacon was sent to this server
         + lastVote - true if its late vote response was yes
         + up - true if this server is believed to be up
         + beaconSinceDown - true if a beacon has successfully been sent to
           this server since the last time it was down

      The following fields are used by the RECOVERY package to track the
      state of each server.  They are also examined and modified by the
      various ContactQuorum_* functions.  Currently, they are sometimes
      accessed under the database lock, and sometimes without benefit of
      any lock.  They should be fully protected by the database lock.

         + version - server's database version
         + currentDB - true if server's database is up-to-date

      The addr[] array is used to track all known addresses for the remote
      server; this list is used when attempting to contact a down server
      via an alternate address, and to identify the source of incoming
      server-to-server RPCs.  The first element of this array is always the
      primary address and does not change after initialization; other
      elements are updated during initialization and when a remote server
      calls DISK_UpdateInterfaceAddr().

      The system normally maintains an active Rx connection to the VOTE and
      DISK services on each server; pointers to these are kept in the
      'vote_rxcid' and 'disk_rxcid' fields.

      Currently, the addr[] array, 'vote_rxcid', and 'disk_rxcid' are not
      protected by any lock.  Because these are shared among a number of
      subsystems and it is usually desirable to avoid blocking on the
      database lock when making RPCs, it is recommended to introduce a new
      lock (the "server address lock") to protect these fields and the
      globals 'ubikSecClass' and 'ubikSecIndex', which contain the security
      classes used for setting up such connections.  The new lock would
      need to be held in the following cases:
      + Before beginning any RPC using 'vote_rxcid' or 'disk_rxcid', for
        the time it takes to call rx_GetConnection() or rx_NewCall()
      + For the duration of ubikGetPrimaryInterfaceAddr()
      + In ubeacon_Interact(), when building the set of connections to be
        used for multi_VOTE_Beacon().
      + In updateUbikNetworkAddress(), first while constructing the set of
        connections to be used for multi_DISK_UpdateInterfaceAddr(), and
        then again each time a server's list of addresses is updated.
        However, the lock must NOT be held while waiting for the RPC to
        complete, to avoid deadlock when multiple servers are starting at
        the same time.
      + For the duration of SDISK_UpdateInterfaceAddr()
      + For the duration of ubeacon_ReinitServer(), with the optional
        exception of the call to afsconf_UpToDate().
      + In src/ubik/recovery.c:DoProbe(), first while constructing the set
        of connections to be used for multi_DISK_Probe() and, if the probe
        is successful, again while destroying the old connections for the
        server under test and replacing them with new ones.  See the
        section on the RECOVERY package for additional notes.


GLOBAL VARIABLES

   The library has an unfortunate number of global variables which are not
   part of any data structure.  Nonetheless, some of these are
   database-specific; see MULTI-DATABASE SUPPORT, below.

   Globals private to a particular subsystem are discussed in the
   description of that subsystem.  This section discusses global variables
   which are shared and/or part of external interfaces.

   The following globals are used to convey configuration parameters from
   the application to the Ubik library, and are intended to be set before
   the first call to ubik_ServerInit().  They are not modified by Ubik, and
   are not protected by locks:

      + ubik_debugFlag - print debug messages (unused)
      + ubikPrimaryAddrOnly - use only primary interface
      + ubik_nBuffers - number of DISK package page cache buffers
      + ubik_SRXSecurityProc - callback to create server security class
      + ubik_SRXSecurityRock - argument for ubik_SRXSecurityProc
      + ubik_CRXSecurityProc - callback to create client security class
      + ubik_CRXSecurityRock - argument for ubik_CRXSecurityProc
      + ubik_SyncWriterCacheProc - callback to update application cache
      + ubik_CheckRXSecurityProc - callback to check if caller is OK
      + ubik_CheckRXSecurityRock - argument for ubik_CheckRXSecurityProc

   The following globals contain values which are computed during Ubik
   initialization and not modified thereafter.  They are not protected by
   any lock, which is safe so long as suitable memory barriers exist across
   creation of threads and Rx services (see the notes on initialization in
   the UBIK package section).  Note that some of these values are
   database-specific and properly belong in the ubik_dbase structure:

      + ubik_servers - linked list of servers (see above)
      + amIClone - true if this server is a clone
      + ubik_callPortal - UDP port used for server-to-server RPCs
      + ubik_quorum - number of servers to make a quorum
      + ubik_dbase - pointer to the (only) database (see above)
      + ubik_host[] - array of local addresses
      + ubik_sc[] - security classes for Ubik RPC services

   The following globals are currently protected by the database lock,
   though in most cases there are existing references which do not hold the
   lock.  Later sections of this document contain recommendations in each
   case to establish new locks to protect these or to correct cases where
   they are accessed without the correct lock.

      + ubik_currentTrans - pointer to REMOTE's current transaction
      + ubik_epochTime - epoch for transaction IDs
      + urecovery_state - recovery state bits
      + ubik_dbVersion - sync site's database version
      + ubik_dbTid - sync site's transaction ID
      + ubikSecIndex - security class index for making Ubik RPCs
      + ubikSecClass - security class for making Ubik RPCs

   The global 'ubik_amSyncSite' belongs to the BEACON package, and should
   be protected by the same lock as that package's private globals.  In
   fact, it should probably be made private; the only references outside
   BEACON are in SVOTE_Debug() and SVOTE_DebugOld() and can easily be moved
   into ubeacon_Debug().


MAJOR SUBSYSTEMS

   This section describes each of the major Ubik server subsystems.  For
   each subsystem, a brief overview is provided, along with a description
   of that subsystem's handling of synchronization (or lack thereof).
   Subsystems are described beginning with the lowest layer and working
   toward the highest.

   PHYS - Physical I/O

      This package is responsible for handling I/O to database files.  It
      maintains a private, global array of MAXFDCACHE(4) fdcache
      structures, each representing an open file descriptor on a database
      file, either presently in use or idle and available for use.  Entries
      are indexed by database file ID, and each contain a file descriptor
      number and a refcount.  There is also a global 'inited' flag, and
      a buffer used to construct pathnames when opening database files.

      The static function uphys_open() takes a database pointer and fileid
      and attempts to open the corresponding file.  On success, it returns
      a file descriptor; on failure, it returns the same as open(2) (with
      errno set).  This prefers to reuse an already-open fd whose cache
      refcount is 0; failing that, it opens a new fd and attempts to enter
      it in the cache.  It prefers first to use an unused cache entry, then
      to reclaim an idle one whose refcount is 0; if neither is possible,
      the newly-opened fd is returned without caching.  Note that the
      refcount on an active entry is always exactly 1; the same entry
      cannot be referenced more than once.

      The function uphys_close() releases a file descriptor opened by
      uphys_open().  If the file descriptor was cached, the cache entry is
      dereferenced but the file is left open for future use; 0 is returned.
      If the file descriptor is not found in the cache, then the return
      value is that of an immediate call to close(2).  This function also
      handles cleanup when a cached fd is closed after the file to which it
      refers is invalidated by uphys_invalidate().  This function is not
      static, but may as well be -- it is used only by other functions in
      the PHYS package.

      Functions in this package are called via method pointers in the
      database structure, which are set during initialization and not
      changed thereafter.  These functions must be called with the database
      lock held; this protects the file descriptor cache and other globals
      mentioned above, and prevents conflicting access to database files.

      The requirement that functions in this package be called with the
      database lock held means that concurrent access to database files is
      not possible.  This situation could be improved by the addition of
      a new global lock, which would be held for most of the duration of
      uphys_open(), uphys_close(), and uphys_invalidate(), but need _not_
      be held during actual file accesses.  This would eliminate the
      requirement that calls to this package be made with the database lock
      held, allowing multiple calls to be active concurrently.  However, it
      would still be necessary for higher layers to employ appropriate
      synchronization as needed to maintain database consistency.


   DISK - Logical Database I/O and Transaction Management

      This package is responsible for logical database file I/O, logging,
      and transaction management.  It maintains a private cache of database
      file page buffers.  This entire cache, including buffer space, is
      allocated and initialized the first time a transaction is started;
      the global initd flag keeps track of whether or not this has been
      done.  The size of this cache defaults to 20 pages, but may be
      overridden by setting the global 'ubik_nBuffers' (all OpenAFS Ubik
      services do this); this global is not protected by any lock, but is
      read exactly once, when the cache is initialized.

      Except for udisk_Debug() (see below), all udisk_* interfaces require
      a pointer either to the database or to an active transaction, and
      must be called with the database lock held.  This protects the DISK
      package's accesses of the active transaction list and associated
      state variables as well as its calls into the physical access layer,
      including insuring that operations requiring multiple calls into that
      layer are performed atomically.

      In addition, the database lock protects the page cache hash table and
      LRU queue, the buffer control structures, buffer contents, the
      truncation record free list (freeTruncList), and global statistics.
      However, it may be desirable in the future to introduce a separate
      lock to protect these structures, in order to facilitate multiple
      database support.

      udisk_LogOpcode, udisk_LogEnd, udisk_LogTruncate, udisk_LogWriteData

         These are internal interfaces for writing to the transaction log.
         They each construct a log record and then directly call the
         physical layer to append it to the log; the log file is _not_
         cached by the page buffer cache, and these functions do not touch
         the cache.  Each of these must be atomic with respect to other
         writes to the log file, which is accomplished by insuring they are
         called only with the database locked.

      DInit, DRead, DTrunc, DRelease, DFlush, DAbort, DSync, DNew

         These are internal interfaces for accessing the page buffer cache.
         They must be called with the database lock held.

         DRead() and DNew() return a pointer to the page data buffer, not
         to the buffer control structure; this pointer is later passed to
         DRelease() to release the buffer.  This mechanism depends on the
         fact that buffer control structures are allocated in the same
         order as the actual page data buffers.

         DRead() guarantees that a read transaction will always see a clean
         copy of the requested page, without any modifications made by an
         in-progress write transaction.  However, this is true only if the
         next layer insures that once DRead() is called with write intent
         on a page, that no further calls to DRead() are made for that page
         until the buffer is written and DRelease() is called with the
         dirty flag.  Presently, this is the case because udisk_read() and
         udisk_write(), which are the only callers of DRead(), are both
         called with the database lock held.

      udisk_read, udisk_truncate, udisk_write, udisk_begin, udisk_commit,
      udisk_abort, udisk_end, udisk_Invalidate

         These are external interfaces provided by the DISK package to
         higher layers.  Presently, the locking requirements of these
         functions and the lower-layer functions they call are satisfied by
         the simple rule that these must always be called with the database
         lock.  In the future, support for multiple databases and/or
         a greater level of concurrency may require relaxing this rule
         and/or acquiring additional locks within these functions.

         In udisk_commit(), when committing the first write transaction
         after becoming sync site, the database is relabelled.  In this
         case, the UBIK_RECLABELDB recovery state bit should be set before
         propagating the label change to other servers, rather than after.
         This is because because ContactQuorum_DISK_Setversion() will
         release the database lock, at which point the recovery state may
         be cleared by urecovery_ResetState() running in another thread.
         It is important that a relabelling which occurs before recovery
         state is cleared not result in the UBIK_RECLABELDB recovery state
         bit being set after; otherwise, the server may fail to correctly
         relabel the database the next time it becomes sync site.

      udisk_Debug

         The udisk_Debug() interface is called as part of preparing the
         response to Ubik debugging requests.  Unlike other udisk_* APIs,
         this interface is not called on a specific database and does not
         require the database lock to be held.  It reports the version data
         of the single database identified by the global 'ubik_dbase' and
         traverses the buffer cache collecting statistics, both without
         benefit of holding any locks.  This may produce inconsistent data
         but does not normally risk damaging data structures or accessing
         invalid or uninitialized memory (but see below), and avoiding
         locking may be of greater benefit than guaranteeing the return of
         consistent data.

         The global 'ubik_dbase' pointer is set when the first (only)
         database is initialized, and the buffer cache data structures are
         allocated the first time a transaction is started, either locally
         or via the REMOTE package.  Once set, the pointers used by
         udisk_Debug() are never changed, and the data structures they
         point to remain valid and are never freed.  The buffer cache is
         traversed by iterating over the array of buffer cache elements
         (this is done in other places as well), not by following a linked
         list or other data structure that may reordered, and so should not
         be dangerous.

         At a minimum, then, it should be arranged that udisk_Debug()
         cannot be called before the database structure and page cache have
         been initialized.  This means that both of these actions must be
         taken before any incoming RPCs are accepted.  Practically, this
         means that DInit() should be renamed udisk_init(), and called from
         ubik_ServerInitCommon() during initialization.

         Full correctness demands that udisk_Debug actually hold the
         database lock while copying the database version and examining the
         page cache.  However, it may be desirable to sacrifice this
         correctness in order to reduce interaction between the debugging
         interfaces and the rest of the system.

      Helper functions:
         + unthread - remove transaction from activeTrans
         + Dlru, Dmru - move page to front/end of LRUQ
         + MatchBuffer - check if a buffer's dbase/file/page match
         + FixupBucket - move page to the right bucket
         + newslot - find and allocate an unused buffer
         + DedupBuffer - throw away stale copies after page is flushed
         + GetTrunc, PutTrunc - trunc record allocation
         + FindTrunc - find active trunc record for file
         + DoTruncs - apply truncations


   LOCK - Database Transaction Locking

      This package is responsible for managing database locks held as part
      of a transaction.  It supports a single read/write lock, which is
      represented by the global variable 'rwlock', a struct Lock.  The
      global 'rwlockinit' records whether Lock_Init() has been called on
      this lock, and is protected by the database lock.  Properly, this
      lock should be treated as belonging to the database.

      ulock_getLock() expects to be called with the database lock held; it
      depends on this to protect access to fields within the transaction
      structure as well as to the global 'rwlockinit'.  However, note that
      the database lock must be temporarily released while obtaining the
      read/write lock, and so ulock_getLock() should not be called in the
      middle of a critical section.  The 'await' parameter may in theory be
      set to 0 to effect a non-blocking lock; however, this depends on
      (incorrect) knowledge of the internals of struct Lock.  Fortunately,
      existing Ubik code always sets await=1.  This feature should be fixed
      or removed.

      ulock_relLock() releases the read/write lock.  It expects to be
      called with the database lock held.  Unlock ulock_getLock(), this
      function does not release the database lock.

      ulock_Debug() reports the state of the read/write lock, for debugging
      purposes.  It depends on knowledge of the internals of struct Lock,
      and accesses both the Lock structure and the global rwlockinit
      without benefit of holding any lock.

      It would probably be best to eliminate 'rwlockinit' in favor of
      always initializing the global 'rwlock' during startup; ulock_Debug()
      would then have no need to examine the flag.  Further, ulock_Debug()
      should at a minimum use the appropriate locks when looking inside
      struct Lock; ideally, knowledge of that structure's internals should
      be abstracted into appropriate interfaces in src/lwp/lock.[ch].


   VOTE - Voting Logic and VOTE_* RPCs

      This package is responsible for keeping track of who is currently
      sync site and deciding how to vote and whether this server should try
      to become sync site.  It provides the VOTE RPC package, which is
      primarily responsible for processing vote requests from peer servers
      (VOTE_Beacon), but also provides debugging interfaces and an
      interface (VOTE_GetSyncSite) to allow clients to discover the current
      sync site.

      Voting decisions are made using state tracked in a set of global
      variables, listed below.  Currently there is no locking protecting
      these variables; LWP-based Ubik depends on the cooperative nature of
      LWP and on the fact that routines which examine and manipulate them
      do not block.  Introduction of a new "vote lock" is recommended to
      protect the following globals:

      + ubik_lastYesTime - last time VOTE_Beacon voted "yes"
      + lastYesHost - host voted for at ubik_lastYesTime
      + lastYesClaim - time lastYesHost started its voting cycle
      + lastYesState - whether lastYesHost claimed to be sync site
      + lowestHost - lowest host seen
      + lowestTime - last time lowestHost was updated/heard from
      + syncHost - sync host, or 0 if none known
      + syncTime - last time syncHost was updated
      + ubik_dbVersion - sync site's database version
      + ubik_dbTid - sync site's transaction ID

      The vote lock should be initialized in uvote_Init() and held for the
      remainder of that function and also during of uvote_ShouldIRun(),
      uvote_GetSyncSite().  It should also be held for the bulk of
      SVOTE_Beacon() -- specifically, it should be acquired just before
      beginning the lowest-host calculation, and released just before
      calling urecovery_CheckTid(), in the event of a yes vote or else just
      before returning.

      In addition, the vote lock would protect the globals 'ubik_dbVersion'
      and 'ubik_dbTid', which represent this server's knowledge of the
      state of the database on the sync site.  This variable is set in two
      places in the REMOTE package, which change the local and sync site
      database versions at the same time.  It is also examined in one place
      in the REMOTE package, and compared to the database version in the
      RECOVERY package.  It is imperative that when the local and remote
      versions are changed at the same time, these changes become visible
      to the RECOVERY package at the same time.  Thus, both the changes and
      the comparison must be done under both locks.  For this purpose,
      three new functions should be provided by the VOTE package:

      + A function to safely set 'ubik_dbVersion'
      + A function to compare 'ubik_dbVersion' to another value
      + A function to check whether this is sync site and compare
        'ubik_dbVersion' to another value, without releasing the vote lock
        in between (to be used in recovery).

      Existing accesses to 'ubik_dbVersion' in REMOTE and RECOVERY, all of
      which are already made under the database lock, would be replaced
      with calls to the new accessors.

      The result of these changes will require in several places that the
      vote lock be acquired while the database lock is already held.
      Therefore, the vote lock MUST NOT be held while acquiring the
      database lock.

      SVOTE_Beacon() currently calls urecovery_CheckTid() without benefit
      of the database lock, which must be held when calling that function.
      This is fairly simple to fix; however, the vote lock must be released
      first.  One simple approach would be to move the call to
      urecovery_CheckTid() to after the point where the vote lock will be
      released, wrapping it in DBHOLD/DBRELE, and conditionalizing the
      whole thing on 'vote' being nonzero.

      Separately, it may be worth examining whether it is appropriate to
      call urecovery_CheckTid() only when voting yes, and not whenever
      a beacon is received from a server claiming to be sync site.

      As with the debug functions in other packages, SVOTE_Debug() and
      SVOTE_DebugOld() access global variables belong to this and other
      modules without benefit of the appropriate locks.  There is also very
      little abstraction here.  It is probably desirable to introduce Debug
      interfaces for the VOTE and RECOVERY packages, to be called from
      these functions.  Full correctness requires...

      + Holding the vote lock while accessing the VOTE package globals
        described above.
      + Holding the database lock while accessing the database flags and
        tidCounter, ubik_epochTime, ubik_currentTrans, and urecovery_state


   BEACON - elections, tracking whether this is sync site

      This package is responsible for running elections, collecting votes,
      and deciding whether and for how long this is the sync site.  It also
      contains routines responsible for setting up the list of servers and
      acquiring alternate addresses for other servers during startup.

      For this purpose, it maintains a number of global variables and
      ubik_server structure fields.  Currently, these variables and fields
      are sometimes accessed under the database lock, and sometimes without
      benefit of any lock.  In order to allow this package to operate with
      a minimum of disruption due to database accesses, it is recommended
      that a new lock be established to protect these.  It is particularly
      desirable to prevent loss of sync due to the potentially long-lived
      RPCs the RECOVERY package must make under the database lock when
      transferring full database copies between servers; for this reason,
      it is important that both the BEACON thread and SVOTE_Beacon() be
      able to operate without acquiring the database lock.  The global
      variables and ubik_server structure fields to be protected by the new
      lock are the following:

      + ubik_amSyncSite - true if this is the sync site
      + syncSiteUntil - time until which this is the sync site
      + ts->lastVoteTime - last time this server voted yes for us
      + ts->lastBeaconSent - last time a beacon was sent to this server
      + ts->lastVote - true if its late vote response was yes
      + ts->up - true if this server is believed to be up
      + ts->beaconSinceDown - true if a beacon has successfully been sent
        to this server since the last time it was down

      The new lock would need to be held in the following cases:
      + In ubeacon_AmSyncSite(), from the first time 'ubik_amSyncSite' is
        examined until after the potential call to urecovery_ResetState().
      + In ubeacon_Interact(), when building the list of servers to which
        to send beacons (to protect access to the 'up' flag).  Note that
        the server address lock must also be held during this loop, and
        must be acquired _after_ the beacon lock.
      + In ubeacon_Interact(), when updating status associated with
        a server after receiving its beacon response.
      + In ubeacon_Interact(), when updating globals after an election.
        However, it must be released before calling urecovery_ResetState().
      + In updateUbikNetworkAddress(), when marking a server down.
      + In the various ContactQuorum_* functions, when checking whether
        a server is up before calling it, and again when marking a server
        down after a call fails.
      + In ubik_EndTrans, when examining a server's 'beaconSinceDown' flag
        and 'lastBeaconSent' timestamp to determine whether it has recently
        gone down, requiring a brief hold until it either comes back up or
        times out.

      The following global variables are set during initialization and do
      not change thereafter:

      + nServers - number of voting servers
      + amIMagic - true if this server gets the extra half vote
      + ubik_singleServer - true if this is the only server

      Currently, all initialization of this module is done _after_ the Rx
      services have been created and an Rx server thread started.  This is
      done to avoid a situation in which all servers are blocked in
      updateUbikNetworkAddress(), which is responsible for exchanging
      addresses with peer servers, while none of them is prepared to
      handled the DISK_UpdateInterfaceAddr call made by that function.

      When using LWP, this early initialization of Rx is not a problem,
      because the various initialization routines don't actually block until
      updateUbikNetworkAddress() waits for RPCs to complete.  However, when
      using pthreads, it is a problem, because it means that incoming RPCs
      may be processed when initialization is not complete.

      To address this problem and insure orderly initialization, it is
      recommended to change the initialization sequence so that most work,
      including the bulk of BEACON package initialization, happens before
      Ubik's Rx services have been created.  The only work that should be
      deferred until after that point is updateUbikNetworkAddress(),
      followed by starting the long-running threads belonging to the BEACON
      and RECOVERY packages.  This requires updateUbikNetworkAddress() to
      be exported (and perhaps to undergo a name change).

      The major part of the work of this package is done in the function
      ubeacon_Interact(), which is the body of a long-running thread known
      as the beacon thread.  This thread is responsible for periodically
      sending "beacons" to other servers to request votes (when the sending
      server is the sync site, these also include how long it will remain
      so, the database version, and the active transaction ID).  Beacons
      are sent only from servers that are already sync site or believe they
      are the best candidate, as determined by uvote_ShouldIRun().

      On the sync site, the beacon thread is responsible for insuring that
      new votes are collected frequently enough to avoid loss of quorum.
      This means it must not block for an extended time; therefore, it must
      not acquire the database lock, which in other threads may sometimes
      be held during long-running operations (most notably, database file
      transfers done under this lock by the recovery thread).  Instead,
      a new lock is proposed (the "version lock"), which is used to allow
      the beacon thread to examine (but not modify) certain fields without
      holding the database lock.

      The version lock should be acquired _in addition to_ the database
      lock when modifying 'ubik_epochTime' or the database 'version',
      'flags', 'tidCounter', and 'writeTidCounter' fields; such
      modifications occur at the following locations:
      + in ubik.c:BeginTrans(), when beginning a new transaction
      + in udisk_begin(), when setting DBWRITING
      + in udisk_commit(), when updating the database version at the end of
        a write transaction (note this includes the case of relabelling the
        database on the first write transaction after becoming sync site)
      + in udisk_end(), when clearing DBWRITING
      + in recovery.c:InitializeDB(), when setting the initial version of
        an empty database
      + in urecovery_Interact(), when labelling a new database the first
        time quorum is established
      + in urecovery_Interact(), after fetching the latest database from
        another server (whether successful or not; there are two cases)
      + in SDISK_SendFile(), when updating the database version both before
        and after receiving a new database from the sync site.  Note the
        lock must _not_ be held during the transfer.
      + in SDISK_SetVersion()

      The version lock should be acquired by the beacon thread while
      examining these fields, shortly before calling multi_VOTE_Beacon().
      Since it must release the lock before making that call, the database
      version will need to be explicitly copied into a local variable, as
      is already done with the current transaction ID.

      Note that if UBIK_PAUSE is defined, the DBVOTING flag poses an
      additional problem, as it must be modified by the beacon thread.  If
      it is desirable to continue to support this variant, then it becomes
      necessary for all accesses to the database flags to be protected by
      the version lock.  Other UBIK_PAUSE code may not have been reviewed
      in depth and may pose additional problems.

      The function ubeacon_AmSyncSite() is called both from the beacon
      thread and from various other points to determine whether they are
      running on the sync site, and thus whether various operations are
      safe and appropriate.  Calls to this function from outside the beacon
      thread (often, via urecovery_AllBetter()) must be made with the
      database lock held, and for operations modifying the database or
      relying on its contents to remain constant, this lock must not then
      be released until the operation is complete.  The does not guarantee
      that this will remain sync site for the duration of the operation,
      but does insure that any operations done after or as a result of
      losing sync will in fact occur after the operation holding the lock.

      Of course, ubeacon_Interact() itself cannot call ubeacon_AmSyncSite()
      with the database lock held, since it must not block on that lock
      when running on the sync site (otherwise sync could be lost, as
      described above).  So, that portion of ubeacon_AmSyncSite() other
      than the call to urecovery_ResetState() should be abstracted out into
      a new function, which presumably returns separate flags indicating
      both whether it is in fact running on the sync site (which becomes
      the return value of ubeacon_AmSyncSite()) and whether a call to
      urecovery_ResetState() is indicated.  When the latter is set, the
      beacon thread can acquire the database lock (since, if this is not
      the sync site, blocking on this lock is not a problem) and call
      urecovery_ResetState().

      This difference in behavior is acceptable because the required
      invariant is that when 'ubik_amSyncSite' is cleared, it cannot become
      true again until urecovery_state has also cleared.  This insures that
      once a server discovers it is not sync site, urecovery_AllBetter()
      cannot succeed on the basis of being sync site until sync is regained
      and recovery has run, insuring the database is up to date.  Since
      only the beacon thread ever sets 'ubik_amSyncSite' true, that thread
      can meet the invariant by calling urecovery_ResetState() before doing
      anything else, while other threads must continue to hold the beacon
      lock to prevent the beacon thread from setting 'ubik_amSyncSite'.

      ubeacon_Debug() reports the value of syncSiteUntil, without benefit
      of obtaining any locks.  It should also report the value of
      'ubik_amSyncSite', allowing that global to be made private to the
      BEACON package.  Full correctness requires that the beacon lock be
      held while accessing these globals.


   RECOVERY - Consistency, Down Server Recovery, Propagation, Log Replay

      This package is responsible for tracking whether the local database
      is current and usable and for discovering when a server comes back up
      after having been marked down.  On the sync site, it is also
      responsible for tracking the database versions on remote servers and
      for recovering from inconsistencies.

      To carry out these tasks, the RECOVERY package maintains a single
      global, 'urecovery_state'.  This is not currently effectively
      protected by any lock; however, it should be protected by the
      database lock.  This is particularly important because high-level
      routines that manipulate the database depend on a check against this
      field to determine that the database is up-to-date and safe to
      modify, and that it will remain so until the operation is complete.

      The main work of this package is done by urecovery_Interact(), which
      is the body of a long-running thread known as the recovery thread.
      This thread wakes up periodically to discover if any down servers
      have come back up and, on the sync site, to locate the latest
      available database, fetch it if necessary, and make sure it has been
      distributed to all servers.  This work is done in several steps,
      taken on in sequence every few seconds:

      + The first part of urecovery_Interact() is to identify servers which
        are believed to be down, and determine whether any of them have
        come back up.  This is done every 30 seconds, even when this is not
        the sync site.  Currently, the database lock is held for the
        duration of this loop, released only in DoProbe() while actually
        waiting for the DISK_Probe() RPC to complete.  This is mostly
        unnecessary; the server probe loop needs this lock only to examine
        the database 'currentDB' flag and to clear the UBIK_RECFOUNDDB bit,
        and DoProbe() doesn't need it at all.  However, the beacon lock is
        needed to examine and update the database 'up' flag, and DoProbe()
        needs to hold the server address lock when examining server
        addresses and, if the server comes back up, to replace connections.
        See the section on struct ubik_server for more details on this
        lock.

      + The next step is to determine whether this is the sync site, and
        update the UBIK_RECSYNCSITE bit appropriately.  The database lock
        should be held for the duration of this step, as
        ubeacon_AmSyncSite() may end up calling urecovery_ResetState(),
        which requires that lock be held.  If this is not the sync site,
        the cycle ends here.

      + Step three is to determine the location of the latest database.
        This is done by making DISK_GetVersion() calls to all active
        servers, keeping track of the newest version found and the server
        which reported it.  Once this is done, the best version is compared
        to the local version, update the UBIK_RECHAVEDB bit (so that it is
        true if and only if the local version is the latest) set the
        UBIK_RECFOUNDDB bit to indicate the latest database has been
        located, and clear the UBIK_RECSENTDB bit to force any servers with
        out-of-date databases to be updated.

        In this section, it is not necessary to hold any locks while
        collecting versions on remote servers, though as before, it is
        necessary to hold the beacon lock while examining the server 'up'
        flag, and the address lock to obtain a reference on each connection
        as it is used (but this lock should not be held during the RPC!).
        Once all remote versions have been fetched, the database lock is
        needed while examining the local version and updating
        'urecovery_state'.  Note that write transactions may happen while
        versions are collected, resulting in some servers having newer
        versions than were detected.  However, if this happens, the result
        is always consistent and ends with the sync site having the latest,
        canonical version:

        + If the actual latest version X.N before the write was in an epoch
          created by the current sync site, then that version must
          necessarily already be present on the sync site, since writes are
          always committed first on the sync site.  Thus, when the sync
          site updates the version counter, it produces a new version X.N+1
          which no other server can believe it already has; thus, there is
          no possibility of inconsistency.

        + If the actual latest version X.N before the write was in an epoch
          not created by the sync site, then it is possible that after
          quorum was established but before any writes happened, a new
          server came online which had a newer version X.N+1 (that was
          committed to less than a quorum of sites before a crash).  If
          a write transaction begins before recovery detects this fact and
          fetches the version X.N+1 database, then the new write will start
          with the version X.N database, essentially creating a version
          fork.  However, when this happens, the resulting database will be
          version Y.1 (with Y>X), because the first write on a new sync
          site always results in a new database.  Thus, the version X.N+1
          database will no longer be latest (because the local version,
          which is examined last, is Y.1), and will end up being discarded.
          This is no worse than if the server with the version X.N+1
          database had not come up until completely after the transaction
          resulting in version Y.1.

      + Step four is to fetch the latest database, if the sync site does
        not already have it.  Since this step replaces the database
        wholesale, there can be no active transactions while it runs, and
        all other database activity must be locked out for the duration of
        the transfer.  This means the database lock must be held for this
        entire step, starting from the check that the UBIK_RECSYNCSITE
        and/or UBIK_RECFOUNDDB bits are still set.  (Note that the existing
        comment which claims that it is impossible for UBIK_RECFOUNDDB not
        to be set at this point is true only if the database is not
        released since checking or updating that bit in step three).  In
        addition, the address lock must be held while setting up the call
        to be used for DISK_GetFile().

      + Step five, again after verifying that this is still the sync site
        and that the UBIK_RECHAVEDB bit is set, is to distribute the new
        database to any servers which do not already have it.  Before this
        can be done, if the database was newly-initialized (with label
        epoch 1), it is relabeled with version 2.1 before it is distributed
        to other sites.  This allows readany transactions to run on such
        a database (though probably not successfully, since the database is
        empty); it is deferred until this point in recovery to insure that
        if any server in the quorum contains a valid database, that version
        is used rather than the empty one.

        For this step, the database lock should be held starting at the
        point at which the UBIK_RECSYNCSITE and/or UBIK_RECHAVEDB bits are
        tested.  In the event the DBWRITING flag is set, the lock must be
        released while waiting for the active write transaction to end.  In
        this case, once the wait is completed, it is necessary to again
        check urecovery_state to determine that this is still the sync site
        and have an up-to-date database before proceeding.

      It is probably simplest for urecovery_Interact() to acquire the
      database lock at the start of step 2 described above, release it for
      the duration of the version-collection part of step 3 and while
      waiting for write transactions to terminate in step 5, and otherwise
      hold it until the end of the cycle.  If no work is needed, this means
      the lock will be held relatively briefly; if it is necessary to fetch
      and or send full database versions, these operations themselves must
      be done with the lock held, and there is little point in releasing it
      between them.  However, it is acceptable to release the lock between
      any steps described above, provided that each time the database lock
      is released and reacquired, 'urecovery_state' is checked to verify
      that no regression has occurred.  Note that outside of the recovery
      thread, the only changes that may occur to 'urecovery_state' are that
      it may be completely cleared by urecovery_ResetState(), and that the
      UBIK_RECLABELDB bit may be set by udisk_commit(), on the sync site.

      The function urecovery_AllBetter() is called under the database lock
      by routines which need to know whether there is a usable database.
      As noted in the discussion of ubeacon_AmSyncSite() above, callers
      must then continue to hold the database lock until any database
      operation is complete.  Further, callers which intend to write to the
      database currently follow a call to urecovery_AllBetter() with a call
      to ubeacon_AmSyncSite(), to determine whether a write is permissible.
      Unfortunately, this is not safe, because it is possible (though
      admittedly unlikely) for urecovery_AllBetter() to succeed on the
      basis of being a non-sync-site with an up-to-date database, and then
      for a subsequent call to ubeacon_AmSyncSite() to return a true result
      (because the latter may block on acquiring the beacon lock, during
      which time this may become the sync site).  To insure that no write
      operation is performed without an up-to-date, writeable database,
      urecovery_AllBetter() should be modified to indicate whether a write
      operation is permitted, which is the case only when UBIK_RECHAVEDB is
      tested and found to be set.

      The functions urecovery_AbortAll(), urecovery_CheckTid(), and
      urecovery_ResetState() all expect to be called with the database lock
      held.

      The function urecovery_LostServer() need not be called with any
      locks; it is merely a wrapper for ubeacon_ReinitServer().

      The RECOVERY package is also responsible at startup for replaying the
      transaction log and initializing an empty database.  These tasks are
      handled by urecovery_Initialize() and its helpers, ReplayLog() and
      InitializeDB().  Because the helpers use PHYS package functions and
      manipulate database structure fields, urecovery_Initialize() should
      acquire the database lock before calling them.


   REMOTE - remote database I/O (DISK) RPCs

      This package provides the DISK RPC package, which mediates remote
      access to the local database copy, when another server is sync site.
      It also includes DISK_UpdateInterfaceAddr(), which is used during
      initialization to verify consistency of configuration across servers,
      and DISK_Probe(), which is used by the RECOVERY module to determine
      whether the server is up.

      As the functions in this package are RPC implementations, they may be
      called at any time and are entirely responsible for their own
      locking.  There can be at most one active remote write transaction at
      a time; this rule is enforced by this package, which maintains the
      global 'ubik_currentTrans' as a pointer to the active transaction.

      For the most part, the RPCs in this module use a common prologue,
      which includes acquiring the database lock, and do not release that
      lock until just before returning.  However, the common prologue
      examines both 'ubik_currentTrans' and the transaction flags without
      benefit of the database lock.  This can be corrected fairly easily by
      acquiring the database lock sooner; however, this requires referring
      to the database via the ubik_dbase global rather than via the 'dbase'
      pointer in 'ubik_currentTrans'.

      Note that SDISK_GetVersion() and SDISK_SetVersion() contain a sanity
      check to ensure they are not running on the sync site, since these
      are called only from that portion of the recovery loop which runs
      only on the sync site.  These calls to ubeacon_AmSyncSite() must be
      made with the database lock held, for the results to be valid.  The
      same holds true for a commented-out check in SDISK_GetFile() (which
      really should remain that way -- there's no reason not to allow any
      authorized caller to do this at any time).

      Additionally, note that SDISK_Commit() acquires a write lock on the
      application cache, to insure that no transactions are running which
      depend on the application cache, page cache, or underlying database
      to change.  The locking order requires that this lock be acquired
      before taking the database lock; thus, both locks must be acquired
      earlier.

      SDISK_SendFile() is really quite bad in its handling of the database
      lock under error conditions.  Presently, an error condition may cause
      it to release the lock too early nor not at all.

      SDISK_UpdateInterfaceAddr() is called by peer servers as part of
      their startup process, to trade lists of additional addresses and
      insure sufficient agreement on configuration to allow the new server
      to start.  It will need to acquire the lock protecting the addr[]
      array attached to each host.


   UBIK (high-level API)

      This package contains the high-level APIs exported by Ubik to
      applications, including initialization, the APIs for creating and
      using transactions, and various utilities.  It also contains a number
      of internal utilities.

      Initialization

         Initialization of Ubik is handled in ubik_ServerInitCommon(),
         which is called by two API functions, ubik_ServerInit() and
         ubik_ServerInitByInfo(), which provide different interfaces for
         passing configuration information.  This code handles
         initialization of the database, locks, and various globals, calls
         the initialization routines for each subsystem, makes sure Rx has
         been initialized, and creates Ubik's Rx services and threads.

         Most of this is done before any Ubik-specific threads are created
         and before Ubik's RPCs can be called, and so takes place without
         holding any lock.  However, since Rx may have been initialized and
         started before Ubik, it is possible that some Rx worker threads
         may have already been created.  In practice, locks internal to Rx
         should insure that any memory writes made before an Rx service is
         created become visible to any worker running on behalf of that
         service; nonetheless, this should not be depended on, and proper
         locking should be used during all phases of initialization.

         Unfortunately, the current code actually starts Rx services before
         initialization is complete, resulting in a number of possible
         races.  The general THREAD SAFETY section below contains
         recommendations for reordering the startup process to eliminate
         this problem, while still avoiding potential deadlock if multiple
         servers start at once.

      ContactQuorum_*

         Nearly a third of ubik.c consists of various ContactQuorum_*
         functions, which are used to make a particular RPC to every server
         in the quorum.  These used to be a single ContactQuorum()
         function, but have been split into multiple functions to regain
         type-safety.  These retain a common prologue and epilogue, some
         small part of which has been abstracted out, but most of which
         remains duplicated in each of these functions.  Indeed, the only
         part unique to each function is the actual RPC call and, in the
         case of ContactQuorum_DISK_WriteV(), code to fall back to multiple
         calls to DISK_Write() on a per-server basis.

         The code common to all ContactQuorum_* functions will need to
         acquire the beacon lock when checking whether a server is up (no
         calls are made to servers which are not marked up), and again when
         marking a server down.  Note that when a server is marked down,
         the 'up' and 'beaconSinceDown' flags must be cleared at the same
         time, without releasing the beacon lock.  In addition, the helper
         function Quorum_StartIO() must obtain the server address lock
         while it obtains a reference on the Rx connection that will be
         used.

         All of the ContactQuorum_* functions must be called with the
         database lock held; this is necessary to protect their access to
         each server's 'version' and 'currentDB' fields and to the local
         database version.  However, these functions release the lock while
         actually making RPCs.  This behavior is new -- ContactQuorum() in
         OpenAFS 1.4 did not release the database lock during RPCs -- but
         it is safe, because no caller of ContactQuorum_*() depends on
         state read under the database lock before the call to remain valid
         after.

      Transactions

         The main Ubik transaction API consists of ubik_BeginTrans(),
         ubik_BeginTransReadAny(), ubik_BeginTransReadAnyWrite(),
         ubik_AbortTrans(), ubik_EndTrans(), ubik_Read(), ubik_Write(),
         ubik_Flush(), ubik_Seek(), ubik_Tell(), ubik_Truncate(), and
         ubik_SetLock().  The ubik_Begin* functions require a database
         pointer; all others require an active transaction pointer.  These
         functions are called directly by the application without any
         locks, and are responsible for their own locking.  Note that many
         of these examine the transaction type before acquiring any locks;
         this is permissible because the transaction type is immutable once
         the transaction is created (but it requires that the call be made
         from the thread that created the transaction, or that the
         application have used some mechanism to insure an appropriate
         memory barrier exists; the same mechanism used to protect the
         transaction pointer should suffice).

         BeginTrans() contains the common code responsible for starting
         a new transaction; this is made available to applications as
         ubik_BeginTrans(), ubik_BeginTransReadAny(), and
         ubik_BeginTransReadAnyWrite().  If a write transaction is in
         progress, this function waits for it to complete.  However, since
         doing so releases the database lock, it is necessary to call
         urecovery_AllBetter() again once the lock has been reacquired.
         Further, in the UBIK_PAUSE case, the DBVOTING flag may have been
         set, so this also must be retested (see the BEACON package section
         for more on the difficulties of UBIK_PAUSE and DBVOTING).

         Both ubik_Flush() and ubik_Write() examine and manipulate the
         transaction I/O vector without benefit of any lock.  These fields
         ('iovec_info' and 'iovec_data') should be protected by the
         database lock.  In the case of ubik_Write(), this may require
         releasing the lock before calling ubik_Flush(), or the
         introduction of a private ubik_Flush_r() which assumes the lock is
         already held (but note that in either case, the lock is released,
         because ubik_Flush() calls ContactQuorum_* functions).

      Utilities

         The functions ubik_GetVersion() and ubik_WaitVersion() provide the
         application with a way to discover the current database version
         and to wait for it to change.  These interfaces are not currently
         used in OpenAFS.  ubik_GetVersion() needs to acquire the database
         lock while copying the database version.

         The internal function ubikGetPrimaryInterfaceAddr() is used by
         Ubik RPCs to determine a peer server's primary address, given the
         IP address from which a call arrived.  This needs to hold the
         server address lock, as described in the section on struct
         ubik_server.

      Application Cache

         The section on struct ubik_dbase describes the operation of the
         application cache.  The function ubik_CheckCache() is provided to
         allow an application to insure the cache is up to date and obtain
         a read lock on the cache which lasts for the duration of the
         transaction.

         Note that ubik_AbortTrans() currently always invalidates the
         application cache by clearing ubik_dbase->cachedVersion, for which
         it requires a write lock on the cache_lock.  This means that
         aborted transactions block waiting for completion of any
         transactions which hold the cache_lock, which may well include all
         outstanding transactions.  In the case of read transactions, which
         cannot have modified the database, this is unnecessary and may
         significantly reduce concurrency.


THREAD SAFETY

   This section discusses changes needed to insure thread safety.

   The initialization sequence in ubik_ServerInitCommon() should be cleaned
   up, to ensure that every subsystem is fully initialized before starting
   any background threads or accepting RPCs.  The correct initialization
   sequence should look something like this:

      + Initialize error tables
      + Allocate and initialize the database structure
      + Initialize Rx
      + Initialize the various subsystems:
        + udisk_Init() (formerly disk.c:DInit)
        + ulock_Init() (new, pulled out of ulock_getLock)
        + uvote_Init()
        + urecovery_Initialize()
        + ubeacon_InitServerList*
      + Create Rx services
      + Start an Rx server thread
      + updateUbikNetworkAddress() (rename and export from beacon.c)
      + Start the beacon and recovery threads

   Initialization functions may need to be added for some subsystems which
   do not currently have them:

      + src/ubik/disk.c:DInit() should be exported and renamed
        udisk_Init(); it can then be called from ubik_ServerInitCommon()
        instead of from udisk_begin().

      + The lock initialization currently done in ulock_getLock() should
        instead be done in a separate ulock_Init(), which should be called
        from ubik_ServerInitCommon()

   A new lock is needed to protect state belonging to the VOTE package,
   including the globals 'ubik_dbVersion' and 'ubik_dbTid', which represent
   the sync site's database state.  See the VOTE package section for
   details on the use of the vote lock.

   A new lock is needed to protect state belonging to the BEACON package,
   including the globals 'ubik_amSyncSite' and 'syncSiteUntil' and several
   fields in the ubik_server structure.  This lock also protects the
   per-server 'up' flag, which is shared among several modules.  In
   addition, some beacon-specific fields are accessed directly by other
   modules.  Thus, this lock must be visible outside the BEACON module, or
   else accessors must be provided for these items.  See the BEACON package
   section for details on the use of the beacon lock.

   A new lock is needed to provide additional protection for the database
   version, transaction counter, and flags, so that these can safely be
   examined (but not updated) by the beacon thread without the need to
   acquire the database lock.  See the BEACON package section for details
   on the use of the version lock.

   A new lock is needed to protect per-server lists of non-primary
   addresses, connections to the VOTE and DISK services on each server, and
   the client-mode security class objects used to set up these connections.
   See the section on struct ubik_server for details on the use of the
   server address lock.

   The required locking order is described by the following list.  Any of
   these locks may be acquired singly; when acquiring multiple locks, they
   should be acquired in the listed order:

      + application cache lock (dbase->cache_lock)
      + database lock (DBHOLD/DBRELE)
      + beacon lock
      + vote lock
      + version lock
      + server address lock

   Some cleanup is required in various places where existing locking
   conventions are not properly obeyed:

      + SVOTE_Beacon() needs to hold the database lock when calling
        urecovery_CheckTid()

      + Management of the database lock in SDISK_SendFile() needs to be
        cleaned up so that it is always released at the proper time.

      + urecovery_Interact() is not consistent about holding the database
        lock when it examines and updates 'urecovery_state'.  It also
        examines the local database version at least once without benefit
        of this lock.  See the RECOVERY package section for details on
        which locks are needed when by urecovery_Interact().

      + ubik_Flush() and ubik_Write() need to acquire the database lock
        before accessing the 'iovec_info' and 'iovec_data' fields.

      + ubik_GetVersion() needs to acquire the database lock before copying
        the database version.

   The various debugging RPCs in the VOTE packages, and the data collection
   routines they call in other packages, generally operate without benefit
   of any locks.  This is probably desirable, as it avoids any possibility
   of debugging operations blocking on or interfering with the rest of the
   system, and allows debugging data to be collected even in the event of
   some unforeseen deadlock.  It is also "safe", in that it does not risk
   damage to data structures or access to uninitialized memory, provided
   that data structure initialization is complete before these functions
   are called.  However, it does mean that these RPCs may return data that
   is not entirely self-consistent.

      + Correctness requires various locks be held during parts of
        SVOTE_Debug() and SVOTE_DebugOld(); see the VOTE package section.

      + SVOTE_Debug() also examines 'ubik_currentTrans' without benefit of
        the database lock.  This usage is not "safe", as that transaction
        may be freed by another thread while SVOTE_Debug() is examining it.

      + SVOTE_XSDebug() should hold the server address lock while copying
        out server addresses, and other locks while copying per-server
        state.

      + udisk_Debug() should hold the database lock while examining the
        database version and collecting page cache statistics.

      + ulock_Debug should use LOCK_LOCK/LOCK_UNLOCK when examining the
        internals of struct Lock.  Ideally, knowledge of these internals
        should be abstracted to src/lwp/lock.h.

      + ubeacon_Debug() should hold the beacon lock while accessing the
        value of 'syncSiteUntil' and, if the access is moved here from
        SVOTE_Debug/SVOTE_DebugOld, 'ubik_amSyncSite'.

CONCURRENCY THOUGHTS

   This section collections information discussed previously about changes
   that may be required or desirable in order to improve concurrency.  Note
   that none of these changes are needed to insure thread safety, provided
   the changes discussed in the previous section are made.

   + PHYS could be made more concurrent by maintaining a separate lock for
     the fdcache, avoiding the need to DBHOLD while calling these.
     However, higher-layer consistency probably requires that certain calls
     or groups of calls to this package be atomic and isolated.

   + Could DISK be made more concurrent by maintaining a separate lock for
     the page cache and or free truncation record list?

   + Could concurrency be improved by maintaining separate per-transaction
     locks?  If so, what is the locking hierarchy with respect to database
     locks and the DISK and PHYS package locks?

   + Could concurrency be improved by maintaining more reader/writer state
     and/or separate write locks to insure consistency of database file
     access without requiring so much DBHOLD?

   + Alternately, could concurrency be improved by requiring that an active
     transaction be accessed only by the thread that created it?


MULTI-DATABASE SUPPORT

   There isn't any.  Some of the Ubik interfaces and even the organization
   of some internal data structures makes it appear as though multiple
   databases can be supported.  However, this is not the case, for several
   reasons:

   + Only a single set of peer servers is tracked, via a global linked list
     of server structures.  Each server structure includes state such as
     the server's database version and whether it is up-to-date, so it is
     not possible to track multiple databases even for a common set of
     servers.
   + The physical I/O package (phys.c) tracks cached open file descriptors
     without regard to what database a file is part of; thus, it cannot
     handle accessing multiple databases at the same time.
   + There are a variety of globals containing state which needs to be
     tracked on a per-database basis.
   + There are a variety of globals which are effectively protected by
     a per-database lock, or by no lock at all.
   + The various RPCs used to communicate between servers assume a single
     database and do not include a database identifier.  Thus, in the
     presence of multiple databases, it would be impossible to determine,
     for example, for which database a DISK RPC was intended.

   The upshot of this is that considerable work would be needed to get Ubik
   into a condition which would allow a single process to maintain multiple
   independent Ubik databases at the same time.  Note that this is
   orthogonal to the question of whether a single multi-file database is
   possible.


OTHER DATA STRUCTURES

   Some other data structures are mentioned in ubik.p.h or elsewhere, but
   are of relatively little interest from the point of view of threading.
   They are described here primarily as an alternative to deleting
   descriptive text that was already written.

   struct ubik_hdr  - on-disk database file header

      This structure represents the on-disk format of the Ubik database
      file header (label).  It is used only for local variables in
      uphys_getlabel() and uphys_setlabel(), which read and write this
      header, respectively.

   struct ubik_stat - database file status

      This structure is used to convey the size and modification timestamp
      of a database file.  It is used as the type of an output parameter of
      uphys_stat(), and for local variables in callers of that function.

   struct ubik_stats - global statistics

      This structure is used to contain "miscellaneous" global statistics.
      Currently that consists of a single counter which is incremented in
      one place (an exceptional condition in ubik_EndTrans) and is never
      read.  The one increment is performed only when holding the database
      lock; thus, this structure may be treated the same as other globals
      protected by non-global locks.

OTHER NOTES

   There appears to be nothing to insure that calls to disk.c:DRead()
   during a write transaction won't find and use a "clean" cached page when
   there is already a dirty page in the cache resulting from an earlier
   write as part of the same transaction.  This means that if a transaction
   reads a page after writing it, it may read back the original data
   instead of the new, and if a transaction writes a page more than once,
   the later writes may corrupt the database.  It seems likely that we are
   merely lucky in this regard, and this problem should be fixed.

   src/ubik/beacon.c:ubeacon_ReinitServer() assumes 'ubik_CRXSecurityRock'
   is in fact an AFS configuration directory pointer, suitable as an
   argument to afsconf_UpToDate().  This is inappropriate;
   'ubik_CRXSecurityRock' is an opaque argument to
   (*ubik_CRXSecurityProc)(), Ubik must not make assumptions about what it
   is.

   It may be worth examining whether it is appropriate for SVOTE_Beacon()
   to call urecovery_CheckTid() only when voting yes, and not whenever
   a beacon is received from a server claiming to be sync site.

   Recently, code has been introduced which attempts to insure that the
   recovery process does not truncate an old database file before a new one
   has been fully transferred, since this could, depending on the timing of
   a server failure, deprive the new quorum of a valid database.  This
   so-called "new recovery" code violates the abstraction provided by the
   PHYS package, encoding specific  knowledge of the underlying file store
   into the REMOTE and RECOVERY packages.  This means that the backend
   provided by phys.c cannot be replaced with an alternative, since in that
   case recovery would do the wrong thing with the transferred database.

   The DBWRITING loop in urecovery_Interact() is not portable, as it
   depends the contents of the timeout after a call to select(2).  This was
   fine for IOMGR_Select(), which has defined behavior, but not for
   select(2).