[OpenAFS-devel] fileserver profiling

Sun, 13 Mar 2005 14:32:15 -0500

On Tue, 8 Mar 2005 14:45:36 -0500, Kyle Moffett <mrmacman_g4@mac.com> wrote:
> I believe one of the requirements was that the time doesn't go
> backwards?  If you don't use the memory barriers, then either way you
> write the time to shared mem, the following situations could happen
> on a multiproc box:
> 

ok.  I've been operating under the assumption that implementing a
purely userspace gettimeofday would be too costly, but apparently that
assumption is false.  If you're only storing seconds since the epoch,
your value is atomic, so you don't need sanity checking to prevent
time apparently going backwards.  I agree that if you're going to
store usecs too, then you need the read barriers.  When I said we
could coalesce ops and reduce the number of barriers, I was thinking
about a technique similar to RCU that Herlihy discussed back in the
late 80's:

#define APPROX_GENERATIONS 3  /* value depends on frequency of update */

extern volatile struct timeval approx_time[APPROX_GENERATIONS];
extern volatile int approx_time_cur_idx = 0;

then your loop could be simplified a bit:

struct timeval tv = { 0, 0 };
struct timespec sleeptime = { 0, 0 };
register int new_idx;
while(1) {
  new_idx = (approx_time_cur_idx+1)%APPROX_GENERATIONS;
  assert(gettimeofday(&tv, NULL) == 0);
  memcpy(&approx_time[new_idx].tv, &tv, sizeof(struct timeval));
  write_barrier();
  approx_time_cur_idx++;
  write_barrier();
  sleeptime.tv_nsec = 1000000000 - (1000 * tv.tv_usec);
  nanosleep(&sleeptime,NULL);
}

now our read time function is just:

static approx_invocations = 0;
static approx_is_ok = 1;
struct timeval tv;
if (approx_is_ok) {
  do {
    register int idx = approx_time_cur_idx;
    memcpy(&tv, &approx_time[idx%APPROX_GENERATIONS], sizeof(struct timeval));
  } while (idx != approx_time_cur_idx);
  if ((approx_invocations%100)==0) {
    struct timeval check;

    gettimeofday(&check,NULL);

(adding back in all the error handling you do, of course ;)

> > Also, due to the very coarse-grained nature of barriers on most
> > architectures, we wouldn't know, without substantial profiling, how
> > large of an impact this would have on things like instruction-level
> > parallelism and out-of-order execution.
> 
> It's significantly better than a whole syscall with instruction cache
> killing
> on _all_ platforms, except for a very few that do gettimeofday in a
> vDSO.
> 

Oh, I totally agree.  I'm just saying that if we're only interested in
tv_sec, we can make things even faster yet.

> > Aside from the fact that I don't think we need barriers, this should
> > be possible with one barrier.  Herlihy's paper on lock-free and
> > wait-free algorithms has a good discussion of implementations for
> > non-atomic types.  I also don't see the need to store
> > microsecond-level accurate time.  I was simply proposing a solution to
> > the high syscall overhead when we only need very fuzzy time measures.
> > When we need sub-second level accurate time, I don't see any choice
> > but to trap into the kernel.  Doing this in userspace is simply too
> > unreliable, and too high-overhead.
> 
> I think that it does require multiple read barriers in order to ensure
> that
> we read the values in the specified order without terrible caching
> effects.
> Feel free to prove me wrong, though :-D
> 

I think my code snippet above is illustrative.  At most, we should
only need 1 read barrier:

struct timeval tv;
register int idx;
do {
  idx = approx_time_cur_idx;
  memcpy(&tv, &approx_time[idx].tv, sizeof(struct timeval));
  read_barrier();
} while (idx != approx_time_cur_idx); 

> > [snip]
> >> static inline void write_time_nolk(volatile long *time, struct timeval
> >> val) {
> >>         time[0] = val.tv_sec;
> >>         write_memory_barrier();
> >>         time[1] = val.tv_usec;
> >>         write_memory_barrier();
> >>         time[2] = val.tv_sec;
> >>         write_memory_barrier();
> >> }
> >
> > Once again, I always thought that with lock-free or wait-free
> > algorithms you generally wanted to coalesce the updates on global
> > structures into an atomic operation to improve performance, and to
> > simplify async correctness proofs.  I don't see anything wrong with
> > this implementation, I just think it's a tad overkill for what I was
> > proposing.
> 
> My goal for this implementation
> 
> >
> > [snip]
> >> startup:
> >>         /* Open the file */
> >>         fd = open(argv[1], O_RDWR|O_CREAT|O_EXCL, 0666);
> >>         if (fd < 0)
> >>                 usage(argv[0],errno,"Could not open file:
> >> '%s'",argv[1]);
> >>
> >
> > Hmm.  I'd feel better with 0644 perms...
> 
> That's what the umask is for :-D  By default the umask is 022 or 077,
> which
> would convert 0666 into 0644 or 0600, as appropriate :-D.
> 
> > [snip]
> >>                 /* Get a new timestamp */
> >>                 int err = gettimeofday(&newtime,NULL);
> >>                 if (err) usage(argv[0],errno,"Could not get the
> >> time");
> >>
> >>                 /* Bound the microseconds within 10^6 */
> >>                 newtime.tv_usec %= 1000000;
> >
> > Are there conditions where gettimeofday returns a bogus value for
> > tv_usec, and 0 for the return code?
> 
> There are some systems (I don't remember which) where the syscall
> returns a
> number from 0 to ULONG_MAX and until it wraps to ULONG_MAX % 1000000 +
> 1.
> 
> > Since this app isn't running under a realtime scheduler, you're at the
> > mercy of the timer interrupt much of the time.  This won't work
> > reliably on a lot of platforms.  IIRC, Solaris defaults to a 100Hz
> > timer.  I also wouldn't want an app waking up every millisecond to
> > update what should be considered a fuzzy time.
> 
> This program was designed to provide a reasonably frequently updated and
> monotonic gettimeofday replacement that could be used without entering a
> syscall.  If you only want the time() value, you could remove the memory
> barriers and sleep for a second or so.  That's assuming that you don't
> mind being off by about a second in the worst case. (A reasonably good
> assumption for a fuzzy time.
> 
> > [snip]
> >>
> >>         /* First prevent new accesses */
> >>         if (unlink(argv[1]))
> >>                 usage(argv[0],errno,"Could not delete file:
> >> '%s'",argv[1]);
> >>
> >>         /* Now tell listening processes that we're stopped */
> >>         write_time_nolk(oldtime,zerotime);
> >
> > This brings up another important point: we need a way for clients to
> > verify that our daemon is still working.  A simple way to do this
> > would be to keep track of approximate_time() invocations, and
> > occasionally verify against time(), and provide a field for any app to
> > assert a vote of no-confidence.
> 
> That's a completely orthagonal concept, and can be done completely
> within the clients easily as you say.  Instead of voting, why not
> just have the client who notices attempt to obtain an exclusive lock
> on the file in nonblocking mode?  The daemon could be made to maintain
> a lock on the file as long as it's running.  When a process obtains
> an exclusive lock, remove the file and start another daemon, then
> reconnect.  There are a few other considerations, but that extra code
> is the "Just in case" and doesn't need to be optimized.
> 
> > In reality, this mmap'd file concept would work a lot better if it
> > were handled just like the new /dev/poll interface.  The scheduler
> > could simply write the new epoch time into a page every second, and
> > there could just be a character device you could mmap to get the time
> > without trapping into kernel mode.  But, that's a discussion for
> > another list...
> 
> Linux has a couple archs working on vDSOs that implement gettimeofday
> completely in userspace without
> 
> Cheers,
> Kyle Moffett
> 
> -----BEGIN GEEK CODE BLOCK-----
> Version: 3.12
> GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
> L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
> PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r
> !y?(-)
> ------END GEEK CODE BLOCK------
> 
> 

-- 
Tom Keiser
tkeiser@gmail.com