This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] faster memcpy on x64.
- From: Dmitrieva Liubov <liubov dot dmitrieva at gmail dot com>
- To: Andreas Jaeger <aj at suse dot com>
- Cc: libc-alpha at sourceware dot org, "H.J. Lu" <hjl dot tools at gmail dot com>
- Date: Tue, 14 May 2013 20:47:21 +0400
- Subject: Re: [PATCH] faster memcpy on x64.
- References: <20130427221620 dot GA16537 at domone dot kolej dot mff dot cuni dot cz> <518BB251 dot 7040602 at suse dot com> <CAHjhQ93bxAexzbymP6GN-08wLiu9mdxf2MoCXgqA1v-ONYJdFw at mail dot gmail dot com>
This patch looks good to me.
--
Liubov Dmitrieva
Intel Corporation
>
>
> On Thu, May 9, 2013 at 6:27 PM, Andreas Jaeger <aj@suse.com> wrote:
>>
>> Intel, AMD developer, do you have any feedback on the performance of this patch? Please provide it until Monday, 13th - otherwise we've waited long enough on this one and I think this can go in now.
>>
>> Ondrej, consider this approved (with the small change below) and commit on the 14th unless somebody vetoes.
>>
>>
>>
>> On 04/28/2013 12:16 AM, OndÅej BÃlka wrote:
>>>
>>> Hi,
>>>
>>> I was occupied for last few week on analyzing memcpy and memset and I
>>> have better implementations than current. This patch is about memcpy.
>>>
>>> Benchmark results are at
>>> http://kam.mff.cuni.cz/~ondra/memcpy_profile.html
>>> or archived at
>>> http://kam.mff.cuni.cz/~ondra/memcpy_profile_result27_04_13.tar.bz2
>>>
>>> I tried to modify this for memmove and I found that additional
>>> cost is close to zero when not overlapping.
>>> So this implementation can be aliased to memmove.
>>>
>>> Important part there is test of memcpy in hooked gcc which shows small
>>> but real speedup. A memcpy_new 1) is faster on newer processors while
>>> memcpy_new_small on slower.
>>>
>>> Could we test 2) on wider range of usecases and report results?
>>>
>>> Here we hit fact that strings in practice are small and there is we hit
>>> latency to get data.
>>>
>>> The microbenchmarks tests look much better.
>>>
>>> Main speedup is obtained by avoiding computed loops and simplify control
>>> flow for better speculative execution. 1)
>>>
>>> This gives 20% speedup for 32-1000 byte strings.
>>>
>>> Second is that loop that I use is in most architectures asymptoticaly
>>> faster than gcc one for data in L1, L2, L3 cache.
>>> When data is in memory then memory is bottleneck and choice of
>>> implementation can give at most 1%.
>>>
>>> I tested avx version which is slower on current processors due fact that
>>> it is faster to load high and low half separately.
>>>
>>> 1) Except core2,athlon where I need even simpler control flow
>>> (memcpy_new_small) to get that speedup.
>>>
>>> I attached file from which I generated this patch. There are few
>>> mistakes made by gcc, I could post diff againist vanilla version.
>>>
>>> I did not tried optimize for atom yet so I keep ifunc for it.
>>>
>>> Passes testsuite. OK for 2.18?
>>>
>>> Ondra
>>>
>>> 1) File variant/memcpy_new_small.s in 2)
>>> 2) http://kam.mff.cuni.cz/~ondra/memcpy_profile27_04_13.tar.bz2
>>>
>>>
>>> * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: New file.
>>> * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Add
>>> __memcpy_sse2_unaligned ifunc selection.
>>> * sysdeps/x86_64/multiarch/Makefile (sysdep_routines):
>>> Add memcpy-sse2-unaligned.S.
>>> sysdeps/x86_64/multiarch/ifunc-impl-list.c: __memcpy_sse2_unaligned.
>>
>>
>> The last entry should be:
>> * sysdeps/x86_64/multiarch/ifunc-impl-list.c
>> (__libc_ifunc_impl_list): Add: __memcpy_sse2_unaligned.
>>
>> thanks,
>> Andreas
>> --
>> Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
>> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 NÃrnberg, Germany
>> GF: Jeff Hawn,Jennifer Guild,Felix ImendÃrffer,HRB16746 (AG NÃrnberg)
>> GPG fingerprint = 93A3 365E CE47 B889 DF7F FED1 389A 563C C272 A126
>
>