This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] faster memcpy on x64.


Intel, AMD developer, do you have any feedback on the performance of this patch? Please provide it until Monday, 13th - otherwise we've waited long enough on this one and I think this can go in now.

Ondrej, consider this approved (with the small change below) and commit on the 14th unless somebody vetoes.


On 04/28/2013 12:16 AM, OndÅej BÃlka wrote:
Hi,

I was occupied for last few week on analyzing memcpy and memset and I
have better implementations than current. This patch is about memcpy.

Benchmark results are at
http://kam.mff.cuni.cz/~ondra/memcpy_profile.html
or archived at
http://kam.mff.cuni.cz/~ondra/memcpy_profile_result27_04_13.tar.bz2

I tried to modify this for  memmove and I found that additional
cost is close to zero when not overlapping.
So this implementation can be aliased to memmove.

Important part there is test of memcpy in hooked gcc which shows small
but real speedup. A memcpy_new 1) is faster on newer processors while
memcpy_new_small on slower.

Could we test 2) on wider range of usecases and report results?

Here we hit fact that strings in practice are small and there is we hit
latency to get data.

The microbenchmarks tests look much better.

Main speedup is obtained by avoiding computed loops and simplify control
flow for better speculative execution. 1)

This gives 20% speedup for 32-1000 byte strings.

Second is that loop that I use is in most architectures asymptoticaly
faster than gcc one for data in L1, L2, L3 cache.
When data is in memory then memory is bottleneck and choice of
implementation can give at most 1%.

I tested avx version which is slower on current processors due fact that
it is faster to load high and low half separately.

1) Except core2,athlon where I need even simpler control flow
(memcpy_new_small) to get that speedup.

I attached file from which I generated this patch. There are few
mistakes made by gcc, I could post diff againist vanilla version.

I did not tried optimize for atom yet so I keep ifunc for it.

Passes testsuite. OK for 2.18?

Ondra

1) File variant/memcpy_new_small.s in 2)
2) http://kam.mff.cuni.cz/~ondra/memcpy_profile27_04_13.tar.bz2


	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: New file.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Add
	__memcpy_sse2_unaligned ifunc selection.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines):
	Add memcpy-sse2-unaligned.S.
	sysdeps/x86_64/multiarch/ifunc-impl-list.c: __memcpy_sse2_unaligned.

The last entry should be:
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add: __memcpy_sse2_unaligned.

thanks,
Andreas
--
 Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
  SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 NÃrnberg, Germany
   GF: Jeff Hawn,Jennifer Guild,Felix ImendÃrffer,HRB16746 (AG NÃrnberg)
    GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]