This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The direction of malloc?


On Tue, 2013-12-17 at 20:08 +0100, OndÅej BÃlka wrote:
> On Tue, Dec 17, 2013 at 01:59:57PM +0100, Torvald Riegel wrote:
> > On Mon, 2013-12-16 at 22:23 +0100, OndÅej BÃlka wrote:
> > > Please explain how spinning could improve performance in single thread
> > > applications.
> > 
> > You spoke about lockless code, so obviously concurrent code.  My comment
> > was thus referring to concurrent code.  If you have a single-threaded
> > program, then you can avoid synchronization, obviously (ignoring
> > synchronization just for reentrancy...).
> >
> And we for malloc use a switch variable to avoid lock path and set it
> when pthread_create is called? For reentancy a ordinary variable suffices.

Depending on the algorithm, even for reentrancy you might need atomic
operations (eg, to keep under control what the compiler does with the
code, or using a CAS to avoid pending stores).

> > > > 
> > > > > Second problem is that fastbins are per-arena not per-thread which
> > > > > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
> > > > 
> > > > Especially on x86, atomic operations that *hit in the cache* have become
> > > > very fast compared to their costs in the past.  I don't have current
> > > > numbers, but I believe the 50 cycle number is too high; I vaguely
> > > > remember 10-20.
> > > 
> > > A simple benchmark could check a real cost. A problem is that while for core2
> > > and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles.
> > 
> > Sure, this differs per architecture.  But recent Intel CPUs *are*
> > common, so it's not quite correct to say that the latency (or
> > throughput, depending on the algorithm) of an atomic RMW/CAS op is
> > really the primary problem.
> > 
> > > Even with these a malloc+free pair contains 5 atomic instructions on
> > > fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock,
> > > free: set bit that fastbins are in use, put item to bin) 
> > 
> > Maybe, but you still need to be careful when drawing conclusions from
> > that because there are more performance effects in allocation than just
> > the "cycle penalty" you mention (e.g., locality and other memory-system
> > effects caused by differences in where data is placed).
> >
> Its not just that, pretty much every modern allocator uses some
> per-thread cache so it looks like good idea, also oprofile shows lock
> cmpxchg instruction as one of likely culprits. When I try a per-thread
> cache that I posted earlier on a test that does just malloc and free it
> nearly doubles performance.
> 
> There are several factors that come in play a first one is lack of
> locking, second is getting expects correct, third is saving a call of
> int_malloc. Then there are several cycles saved by omiting test for hook
> and malloc_perturb.
> 
> Real implementation will be bit faster as dynamic tls slows this down a
> bit.
> 
> Memory system effects are not a factor here, as allocation pattern is
> identical (stack in both cases).

I was referring to memory system effects when the data is actually
accessed by an application.  It does matter on which page allocated data
ends up (and where on a page relative to other allocations).  The speed
of allocations is just one thing.  For example, just to illustrate, if
you realloc by copying to a different arena, this can slow down programs
due to NUMA effects if those programs expect the allocation to likely
remain on the same page after a realloc; such effects can be much more
costly than a somewhat slower allocation.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]