[OpenAFS-devel] fileserver profiling

Tom Keiser Tom Keiser <tkeiser@gmail.com>
Tue, 8 Mar 2005 14:01:25 -0500


comments inline

On Mon, 7 Mar 2005 22:06:42 -0500, Kyle Moffett <mrmacman_g4@mac.com> wrote:
> 
> static void usage(const char *arg0, int err, const char *fmt, ...)
>         __attribute__((__noreturn__));
> 

isn't __attribute__ a gcc-ism?


[snip]
> 
> /* These are for PPC only; the read memory barrier does too much
> anyways */
> #define read_memory_barrier()   __asm__ __volatile__ ("eieio": :
> :"memory")
> #define write_memory_barrier()  __asm__ __volatile__ ("eieio": :
> :"memory")
> 

Read and write barrier instructions are useful when you need
consistency at the instruction-level.  We are talking about using a
context to keep a fuzzy userspace notion of time around 1-2 second
interval.  I don't think we need to introduce them for these purposes.
 Furthermore, all this ensures is that our process is in perfect
lock-step with the timekeeper processes' _fuzzy_ time.  Also, due to
the very coarse-grained nature of barriers on most architectures, we
wouldn't know, without substantial profiling, how large of an impact
this would have on things like instruction-level parallelism and
out-of-order execution.


[snip]
> /*
>   * The "old_time" is the currently stored value. NOTE: This value is
>   * designed to be read and written locklessly, assuming that reading
>   * and writing a "long" is atomic on your platform:
>   *   To read:
>   *     do {
>   *       sec1 = time[2];
>   *       read_memory_barrier();
>   *       usec = time[1];
>   *       read_memory_barrier();
>   *       sec2 = time[0];
>   *       read_memory_barrier();
>   *     } while(sec1 != sec2);

Aside from the fact that I don't think we need barriers, this should
be possible with one barrier.  Herlihy's paper on lock-free and
wait-free algorithms has a good discussion of implementations for
non-atomic types.  I also don't see the need to store
microsecond-level accurate time.  I was simply proposing a solution to
the high syscall overhead when we only need very fuzzy time measures. 
When we need sub-second level accurate time, I don't see any choice
but to trap into the kernel.  Doing this in userspace is simply too
unreliable, and too high-overhead.


[snip]
> static inline void write_time_nolk(volatile long *time, struct timeval
> val) {
>         time[0] = val.tv_sec;
>         write_memory_barrier();
>         time[1] = val.tv_usec;
>         write_memory_barrier();
>         time[2] = val.tv_sec;
>         write_memory_barrier();
> }

Once again, I always thought that with lock-free or wait-free
algorithms you generally wanted to coalesce the updates on global
structures into an atomic operation to improve performance, and to
simplify async correctness proofs.  I don't see anything wrong with
this implementation, I just think it's a tad overkill for what I was
proposing.


[snip]
> startup:
>         /* Open the file */
>         fd = open(argv[1], O_RDWR|O_CREAT|O_EXCL, 0666);
>         if (fd < 0)
>                 usage(argv[0],errno,"Could not open file: '%s'",argv[1]);
> 

Hmm.  I'd feel better with 0644 perms...


[snip]
>                 /* Get a new timestamp */
>                 int err = gettimeofday(&newtime,NULL);
>                 if (err) usage(argv[0],errno,"Could not get the time");
> 
>                 /* Bound the microseconds within 10^6 */
>                 newtime.tv_usec %= 1000000;
> 

Are there conditions where gettimeofday returns a bogus value for
tv_usec, and 0 for the return code?


[snip]
>                 /* Check our signal status, if we've received one, then we *
>                  * need to handle it and restart or clean up and quit.     */
>                 switch(last_signal) {
>                 case 0:
>                         /* Since we've got no signal, sleep and repeat */
>                         usleep(1000);

Since this app isn't running under a realtime scheduler, you're at the
mercy of the timer interrupt much of the time.  This won't work
reliably on a lot of platforms.  IIRC, Solaris defaults to a 100Hz
timer.  I also wouldn't want an app waking up every millisecond to
update what should be considered a fuzzy time.  In a proof of concept
daemon I wrote a while back, I just nanosleep() for the time remaining
in this second plus a small constant that is dynamically updated to
provide negative feedback if we start waking up just before the second
changes, or if we start drifting too far into the second due to
sampling overhead.


[snip]
> 
>         /* First prevent new accesses */
>         if (unlink(argv[1]))
>                 usage(argv[0],errno,"Could not delete file: '%s'",argv[1]);
> 
>         /* Now tell listening processes that we're stopped */
>         write_time_nolk(oldtime,zerotime);
> 

This brings up another important point: we need a way for clients to
verify that our daemon is still working.  A simple way to do this
would be to keep track of approximate_time() invocations, and
occasionally verify against time(), and provide a field for any app to
assert a vote of no-confidence.

In reality, this mmap'd file concept would work a lot better if it
were handled just like the new /dev/poll interface.  The scheduler
could simply write the new epoch time into a page every second, and
there could just be a character device you could mmap to get the time
without trapping into kernel mode.  But, that's a discussion for
another list...

-- 
Tom Keiser
tkeiser@gmail.com