This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] faster memcpy on x64.


This patch looks good to me.

--
Liubov Dmitrieva
Intel Corporation

>
>
> On Thu, May 9, 2013 at 6:27 PM, Andreas Jaeger <aj@suse.com> wrote:
>>
>> Intel, AMD developer, do you have any feedback on the performance of this patch? Please provide it until Monday, 13th - otherwise we've waited long enough on this one and I think this can go in now.
>>
>> Ondrej, consider this approved (with the small change below) and commit on the 14th unless somebody vetoes.
>>
>>
>>
>> On 04/28/2013 12:16 AM, OndÅej BÃlka wrote:
>>>
>>> Hi,
>>>
>>> I was occupied for last few week on analyzing memcpy and memset and I
>>> have better implementations than current. This patch is about memcpy.
>>>
>>> Benchmark results are at
>>> http://kam.mff.cuni.cz/~ondra/memcpy_profile.html
>>> or archived at
>>> http://kam.mff.cuni.cz/~ondra/memcpy_profile_result27_04_13.tar.bz2
>>>
>>> I tried to modify this for  memmove and I found that additional
>>> cost is close to zero when not overlapping.
>>> So this implementation can be aliased to memmove.
>>>
>>> Important part there is test of memcpy in hooked gcc which shows small
>>> but real speedup. A memcpy_new 1) is faster on newer processors while
>>> memcpy_new_small on slower.
>>>
>>> Could we test 2) on wider range of usecases and report results?
>>>
>>> Here we hit fact that strings in practice are small and there is we hit
>>> latency to get data.
>>>
>>> The microbenchmarks tests look much better.
>>>
>>> Main speedup is obtained by avoiding computed loops and simplify control
>>> flow for better speculative execution. 1)
>>>
>>> This gives 20% speedup for 32-1000 byte strings.
>>>
>>> Second is that loop that I use is in most architectures asymptoticaly
>>> faster than gcc one for data in L1, L2, L3 cache.
>>> When data is in memory then memory is bottleneck and choice of
>>> implementation can give at most 1%.
>>>
>>> I tested avx version which is slower on current processors due fact that
>>> it is faster to load high and low half separately.
>>>
>>> 1) Except core2,athlon where I need even simpler control flow
>>> (memcpy_new_small) to get that speedup.
>>>
>>> I attached file from which I generated this patch. There are few
>>> mistakes made by gcc, I could post diff againist vanilla version.
>>>
>>> I did not tried optimize for atom yet so I keep ifunc for it.
>>>
>>> Passes testsuite. OK for 2.18?
>>>
>>> Ondra
>>>
>>> 1) File variant/memcpy_new_small.s in 2)
>>> 2) http://kam.mff.cuni.cz/~ondra/memcpy_profile27_04_13.tar.bz2
>>>
>>>
>>>         * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: New file.
>>>         * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Add
>>>         __memcpy_sse2_unaligned ifunc selection.
>>>         * sysdeps/x86_64/multiarch/Makefile (sysdep_routines):
>>>         Add memcpy-sse2-unaligned.S.
>>>         sysdeps/x86_64/multiarch/ifunc-impl-list.c: __memcpy_sse2_unaligned.
>>
>>
>> The last entry should be:
>>         * sysdeps/x86_64/multiarch/ifunc-impl-list.c
>>         (__libc_ifunc_impl_list): Add: __memcpy_sse2_unaligned.
>>
>> thanks,
>> Andreas
>> --
>>  Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
>>   SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 NÃrnberg, Germany
>>    GF: Jeff Hawn,Jennifer Guild,Felix ImendÃrffer,HRB16746 (AG NÃrnberg)
>>     GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126
>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]