This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Ling Ma <ling dot ma dot program at gmail dot com>
- Cc: OndÅej BÃlka <neleai at seznam dot cz>, GNU C Library <libc-alpha at sourceware dot org>, Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>, yumkam at gmail dot com, Ling Ma <ling dot ml at alibaba-inc dot com>
- Date: Thu, 5 Jun 2014 10:33:23 -0700
- Subject: Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction
- Authentication-results: sourceware.org; auth=none
- References: <1398055946-4493-1-git-send-email-ling dot ma at alipay dot com> <CAOGi=dOQEbbkkzQGz-ZtQ0-WEHj2=hjmbstZXvZyLqycVy18Kg at mail dot gmail dot com> <20140515202213 dot GA20667 at domone dot podge> <CAOGi=dNbyxj+7gjwcpAVBxYB-MH9E7s=xi2nKwYXkDViasOZrA at mail dot gmail dot com>
On Tue, May 20, 2014 at 8:17 AM, Ling Ma <ling.ma.program@gmail.com> wrote:
> 2014-05-16 4:22 GMT+08:00, OndÅej BÃlka <neleai@seznam.cz>:
>> On Fri, May 09, 2014 at 08:40:46PM +0800, Ling Ma wrote:
>>> If there are still some issues on the latest memcpy and memset, please
>>> let us know.
>>>
>>> Thanks
>>> Ling
>>>
>>> 2014-04-21 12:52 GMT+08:00, ling.ma.program@gmail.com
>>> <ling.ma.program@gmail.com>:
>>> > From: Ling Ma <ling.ml@alibaba-inc.com>
>>> >
>>> > In this patch we take advantage of HSW memory bandwidth, manage to
>>> > reduce miss branch prediction by avoiding using branch instructions and
>>> > force destination to be aligned with avx instruction.
>>> >
>>> > The CPU2006 403.gcc benchmark indicates this patch improves performance
>>> > from 6% to 14%.
>>> >
>>> > This version only jump to backward for memove overlap case,
>>> > Thanks for Ondra'comments, and that Yuriy gave me c code hint on it.
>>
>> As now it is slower than a gcc compilation time becomes around
>> 0.12% slower than pending sse2 version and indistingushible from current
>> version.
>>
>> I used a benchmark that measures total running time of gcc for five
>> hours and report relative time and variance, you could get it here
>>
>> http://kam.mff.cuni.cz/~ondra/memcpy_consistency_benchmark.tar.bz2
>>
>> a results I got on haswell are
>>
>> memcpy-avx.so memcpy-sse2.so memcpy-sse2_v2.so
>> memcpy_fuse.so memcpy_rep8.so nul.so
>> 100.25% +- 0.04% 100.25% +- 0.04% 100.13% +- 0.07% 100.00% +-
>> 0.04% 100.34% +- 0.13% 100.95% +- 0.07%
>>
>> where I tried fusion and rep strategy like in memset which helps.
>>
>> I tried also to measure it with my benchmark on different function, it
>> claims that pending sse2 version is best on gcc+gnuplot load. When I
>> looked to graph it looks that it loses on much branching until it gets
>> to small sizes, see
>>
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_avx.html
>> with profiler here
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_avx150514.tar.bz2
>>
> Ling: we move less_16bytes to code entry, so there are no degradation
> for small size , attached code and your
> profiler:http://www.yunos.org/tmp/memcpy_profile_avx0520.tar.gz
> meanwhile we also tested pending memcpy, it is much better than original one,
> but avx still give us the best result for large input(we can download
> and run it):
> www.yunos.org/tmp/test.memcpy.memset.zip
>
Any updates on this? Where is the latest AVX2 memcpy patch?
I didn't see it at
https://sourceware.org/ml/libc-alpha/
--
H.J.