[OpenAFS-devel] fileserver profiling

Kyle Moffett mrmacman_g4@mac.com
Tue, 8 Mar 2005 14:45:36 -0500


On Mar 08, 2005, at 14:01, Tom Keiser wrote:
> On Mon, 7 Mar 2005 22:06:42 -0500, Kyle Moffett <mrmacman_g4@mac.com> 
> wrote:
>> static void usage(const char *arg0, int err, const char *fmt, ...)
>>         __attribute__((__noreturn__));
> isn't __attribute__ a gcc-ism?

Ah, true, it is.  I completely forgot.

> [snip]
>> /* These are for PPC only; the read memory barrier does too much
>> anyways */
>> #define read_memory_barrier()   __asm__ __volatile__ ("eieio": :
>> :"memory")
>> #define write_memory_barrier()  __asm__ __volatile__ ("eieio": :
>> :"memory")
>
> Read and write barrier instructions are useful when you need
> consistency at the instruction-level.  We are talking about using a
> context to keep a fuzzy userspace notion of time around 1-2 second
> interval.  I don't think we need to introduce them for these purposes.

I believe one of the requirements was that the time doesn't go
backwards?  If you don't use the memory barriers, then either way you
write the time to shared mem, the following situations could happen
on a multiproc box:

Assume a race between the writer updating the time from:
	10 seconds and 900,000 microseconds
to:
	11 seconds and 100,000 microseconds

CPU1          CPU2
               read(sec,usec) => 10, 900000
write(sec)
               read(sec,usec) => 11, 900000
write(usec)
               read(sec,usec) => 11, 100000

If you update usecs then secs:
CPU1          CPU2
               read(sec,usec) => 10, 900000
write(usec)
               read(sec,usec) => 10, 100000
write(sec)
               read(sec,usec) => 11, 100000

Oops, it went backwards!  And that's assuming that the read operation is
atomic, which it isn't.

>  Furthermore, all this ensures is that our process is in perfect
> lock-step with the timekeeper processes' _fuzzy_ time.

This implementation provides basically the same guarantees as the 
syscall
gettimeofday does, except without the syscall and complete instruction
queue flush.  If you want a value that increments every time you call 
the
function, then just add one to the stored copy of the microsecond value
when it doesn't change.

> Also, due to the very coarse-grained nature of barriers on most
> architectures, we wouldn't know, without substantial profiling, how
> large of an impact this would have on things like instruction-level
> parallelism and out-of-order execution.

It's significantly better than a whole syscall with instruction cache 
killing
on _all_ platforms, except for a very few that do gettimeofday in a 
vDSO.

> Aside from the fact that I don't think we need barriers, this should
> be possible with one barrier.  Herlihy's paper on lock-free and
> wait-free algorithms has a good discussion of implementations for
> non-atomic types.  I also don't see the need to store
> microsecond-level accurate time.  I was simply proposing a solution to
> the high syscall overhead when we only need very fuzzy time measures.
> When we need sub-second level accurate time, I don't see any choice
> but to trap into the kernel.  Doing this in userspace is simply too
> unreliable, and too high-overhead.

I think that it does require multiple read barriers in order to ensure 
that
we read the values in the specified order without terrible caching 
effects.
Feel free to prove me wrong, though :-D

> [snip]
>> static inline void write_time_nolk(volatile long *time, struct timeval
>> val) {
>>         time[0] = val.tv_sec;
>>         write_memory_barrier();
>>         time[1] = val.tv_usec;
>>         write_memory_barrier();
>>         time[2] = val.tv_sec;
>>         write_memory_barrier();
>> }
>
> Once again, I always thought that with lock-free or wait-free
> algorithms you generally wanted to coalesce the updates on global
> structures into an atomic operation to improve performance, and to
> simplify async correctness proofs.  I don't see anything wrong with
> this implementation, I just think it's a tad overkill for what I was
> proposing.

My goal for this implementation

>
> [snip]
>> startup:
>>         /* Open the file */
>>         fd = open(argv[1], O_RDWR|O_CREAT|O_EXCL, 0666);
>>         if (fd < 0)
>>                 usage(argv[0],errno,"Could not open file: 
>> '%s'",argv[1]);
>>
>
> Hmm.  I'd feel better with 0644 perms...

That's what the umask is for :-D  By default the umask is 022 or 077, 
which
would convert 0666 into 0644 or 0600, as appropriate :-D.

> [snip]
>>                 /* Get a new timestamp */
>>                 int err = gettimeofday(&newtime,NULL);
>>                 if (err) usage(argv[0],errno,"Could not get the 
>> time");
>>
>>                 /* Bound the microseconds within 10^6 */
>>                 newtime.tv_usec %= 1000000;
>
> Are there conditions where gettimeofday returns a bogus value for
> tv_usec, and 0 for the return code?

There are some systems (I don't remember which) where the syscall 
returns a
number from 0 to ULONG_MAX and until it wraps to ULONG_MAX % 1000000 + 
1.

> Since this app isn't running under a realtime scheduler, you're at the
> mercy of the timer interrupt much of the time.  This won't work
> reliably on a lot of platforms.  IIRC, Solaris defaults to a 100Hz
> timer.  I also wouldn't want an app waking up every millisecond to
> update what should be considered a fuzzy time.

This program was designed to provide a reasonably frequently updated and
monotonic gettimeofday replacement that could be used without entering a
syscall.  If you only want the time() value, you could remove the memory
barriers and sleep for a second or so.  That's assuming that you don't
mind being off by about a second in the worst case. (A reasonably good
assumption for a fuzzy time.

> [snip]
>>
>>         /* First prevent new accesses */
>>         if (unlink(argv[1]))
>>                 usage(argv[0],errno,"Could not delete file: 
>> '%s'",argv[1]);
>>
>>         /* Now tell listening processes that we're stopped */
>>         write_time_nolk(oldtime,zerotime);
>
> This brings up another important point: we need a way for clients to
> verify that our daemon is still working.  A simple way to do this
> would be to keep track of approximate_time() invocations, and
> occasionally verify against time(), and provide a field for any app to
> assert a vote of no-confidence.

That's a completely orthagonal concept, and can be done completely
within the clients easily as you say.  Instead of voting, why not
just have the client who notices attempt to obtain an exclusive lock
on the file in nonblocking mode?  The daemon could be made to maintain
a lock on the file as long as it's running.  When a process obtains
an exclusive lock, remove the file and start another daemon, then
reconnect.  There are a few other considerations, but that extra code
is the "Just in case" and doesn't need to be optimized.

> In reality, this mmap'd file concept would work a lot better if it
> were handled just like the new /dev/poll interface.  The scheduler
> could simply write the new epoch time into a page every second, and
> there could just be a character device you could mmap to get the time
> without trapping into kernel mode.  But, that's a discussion for
> another list...

Linux has a couple archs working on vDSOs that implement gettimeofday
completely in userspace without

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r  
!y?(-)
------END GEEK CODE BLOCK------