This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction


On Tue, May 20, 2014 at 8:17 AM, Ling Ma <ling.ma.program@gmail.com> wrote:
> 2014-05-16 4:22 GMT+08:00, OndÅej BÃlka <neleai@seznam.cz>:
>> On Fri, May 09, 2014 at 08:40:46PM +0800, Ling Ma wrote:
>>> If there are still some issues on the latest memcpy and memset, please
>>> let us know.
>>>
>>> Thanks
>>> Ling
>>>
>>> 2014-04-21 12:52 GMT+08:00, ling.ma.program@gmail.com
>>> <ling.ma.program@gmail.com>:
>>> > From: Ling Ma <ling.ml@alibaba-inc.com>
>>> >
>>> > In this patch we take advantage of HSW memory bandwidth, manage to
>>> > reduce miss branch prediction by avoiding using branch instructions and
>>> > force destination to be aligned with avx instruction.
>>> >
>>> > The CPU2006 403.gcc benchmark indicates this patch improves performance
>>> > from 6% to 14%.
>>> >
>>> > This version only jump to backward for memove overlap case,
>>> > Thanks for Ondra'comments, and that Yuriy gave me c code hint on it.
>>
>> As now it is slower than a gcc compilation time becomes around
>> 0.12% slower than pending sse2 version and indistingushible from current
>> version.
>>
>> I used a benchmark that measures total running time of gcc for five
>> hours and report relative time and variance, you could get it here
>>
>> http://kam.mff.cuni.cz/~ondra/memcpy_consistency_benchmark.tar.bz2
>>
>> a results I got on haswell are
>>
>>        memcpy-avx.so     memcpy-sse2.so     memcpy-sse2_v2.so
>> memcpy_fuse.so    memcpy_rep8.so           nul.so
>>      100.25% +- 0.04%    100.25% +- 0.04%    100.13% +- 0.07%    100.00% +-
>> 0.04%    100.34% +- 0.13%    100.95% +- 0.07%
>>
>> where I tried fusion and rep strategy like in memset which helps.
>>
>> I tried also to measure it with my benchmark on different function, it
>> claims that pending sse2 version is best on gcc+gnuplot load. When I
>> looked to graph it looks that it loses on much branching until it gets
>> to small sizes, see
>>
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_avx.html
>> with profiler here
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_avx150514.tar.bz2
>>
> Ling: we move less_16bytes to code entry, so there are no  degradation
> for small size , attached code and your
> profiler:http://www.yunos.org/tmp/memcpy_profile_avx0520.tar.gz
> meanwhile we also tested pending memcpy, it is much better than original one,
> but avx still give us the best result for large input(we can download
> and run it):
> www.yunos.org/tmp/test.memcpy.memset.zip
>

Any updates on this?  Where is the latest AVX2 memcpy patch?
I didn't see it at

https://sourceware.org/ml/libc-alpha/


-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]