This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: benchmark for atomic.


On Mon, 2013-12-16 at 22:30 +0100, OndÅej BÃlka wrote:
> On Mon, Dec 16, 2013 at 10:23:34PM +0100, OndÅej BÃlka wrote:
> > > 
> > > > Second problem is that fastbins are per-arena not per-thread which
> > > > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
> > > 
> > > Especially on x86, atomic operations that *hit in the cache* have become
> > > very fast compared to their costs in the past.  I don't have current
> > > numbers, but I believe the 50 cycle number is too high; I vaguely
> > > remember 10-20.
> > 
> > A simple benchmark could check a real cost. A problem is that while for core2
> > and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles.
> > 
> > Even with these a malloc+free pair contains 5 atomic instructions on
> > fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock,
> > free: set bit that fastbins are in use, put item to bin) 
> > 
> A benchmark for performance of CAS is here.

That is not a meaningful benchmark for the performance of CAS in
real-world synchronization code.

First, you measure the *whole* loop, not just the CAS.  At the very
least, you need to compare the CAS against a nonatomic CAS, RMW op
(e.g., increment), or store: We need the CAS because we need to change
state, so unless it's just a lock, you can't just get rid of the state
change typically.  See the attached modification of your sample program.

On my machine (i5) with GCC 4.4 and -O2, this gives me (without
additional WORK, first number is the "cycles per loop iteration"):
nonatomic store 1 (0)
nonatomic CAS 2 (0)
nonatomic inc 5 (0)
CAS 15 (0)

So, there's just a 10-cycle "overhead" for the atomic CAS.
(Note that I change x to be volatile in the benchmark, because unlike in
the benchmark loop, typically, the compiler wouldn't be able to infer
which value we had in the modification.)

Second, you are testing "throughput" of CAS, essentially; all the loop
does is apply the CAS to the very same memory location again and again.
This is *not* how typical synchronization code looks.

For example, if there's additional work in the loop (see the WORK macro;
ignore the second number), then I get:
nonatomic store 8 (16716264609715748670)
nonatomic CAS 9 (16716264609715748670)
nonatomic inc 8 (16716264609715748670)
CAS 15 (16716264609715748670)

So, suddenly the CAS "overhead" is just 6 cycles.  Note that I'm not
claiming that typical synchronization code does a lot of multiplications
or such -- I'm just illustrating that there is instruction-level
parallelism and that the performance effects are nontrivial.  Real
(synchronization) code would likely have more branches and loads, and
that might interact with the CAS in plenty of other ways, depending on
the HW implementation.

Thus, you have to compare full algorithms against each other.

Regarding the allocator, I believe it's more worthwhile to focus on
reducing cache and TLB misses first.  *If* allocations are frequent,
then potential sources are false sharing, contention on atomic ops
including lack of back-off, contention on locks (including spinning in a
nonoptimal way), etc.
The second source are suboptimally placement of allocations, leading to
larger cache working set sizes, or more pages being used, etc.  That
depends a lot on the workload and access patterns of the application, of
course.  But it can be worth to spend a bit more time on allocation when
that helps avoiding cache misses when the allocated data is accessed.

I really appreciate your eagerness to improve the allocator, and the
time you spend on it.  Let's just be careful and work as thoroughly as
required by a topic that's as complex as allocation.
#include <stdint.h>
#include <stdio.h>
#include <pthread.h>
static uint64_t  __attribute__((noinline)) rdtsc(void)
{
  uint32_t lo, hi;
  __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
  return (uint64_t)hi << 32 | lo;
}

volatile uint64_t x;
volatile char y[100];

void fn(void *x)
{
  int i = 0;
  while (1)
    {
      y[(i%100)] = 42;
      i++;
    }
}

int main()
{
  x = 0;
  int i;
  int times=1000000000;
  uint64_t dummy;
  uint64_t dummy2;
  pthread_t thread;

#ifdef THREAD
 pthread_create (&thread, NULL, (void *) &fn, NULL);
#endif
//#define WORK
#define WORK dummy = (dummy + 5) * 13 * (dummy2 % 5 == 0? 17 : i); dummy2 = (dummy2 + 3) * 7383097231 * i;

  dummy = dummy2 = 0;
  uint64_t start = rdtsc();
  for (i=0;i<times;i++)
   {
     WORK
     x = i + 1;
   }
  uint64_t end = rdtsc();
  printf ("nonatomic store %llu (%llu)\n", (end-start)/times, dummy+dummy2);

  dummy = dummy2 = 0;
  start = rdtsc();
  for (i=0;i<times;i++)
   {
     WORK
     if (x == i)
       x = i + 1;
   }
  end = rdtsc();
  printf ("nonatomic CAS %llu (%llu)\n", (end-start)/times, dummy+dummy2);

  dummy = dummy2 = 0;
  start = rdtsc();
  for (i=0;i<times;i++)
   {
     WORK
     x = x + 1;
   }
  end = rdtsc();
  printf ("nonatomic inc %llu (%llu)\n", (end-start)/times, dummy+dummy2);

  dummy = dummy2 = 0;
  start = rdtsc();
  for (i=0;i<times;i++)
   {
     WORK
     __sync_bool_compare_and_swap (&x,i,i+1);
   }
  end = rdtsc();
  printf ("CAS %llu (%llu)\n", (end-start)/times, dummy+dummy2);

  dummy = dummy2 = 0;
  start = rdtsc();
  for (i=0;i<times;i++)
   {
     WORK
     __sync_fetch_and_add (&x, 1);
   }
  end = rdtsc();
  printf ("INC %llu (%llu)\n", (end-start)/times, dummy+dummy2);
}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]