This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction


the 0002-memcpy-avx.patch in http://www.yunos.org/tmp/test.memcpy.memset.zip
is our update version, and I will gzip  and send it as attachment.

Thanks
Ling


2014-06-06 1:33 GMT+08:00, H.J. Lu <hjl.tools@gmail.com>:
> On Tue, May 20, 2014 at 8:17 AM, Ling Ma <ling.ma.program@gmail.com> wrote:
>> 2014-05-16 4:22 GMT+08:00, OndÅej BÃlka <neleai@seznam.cz>:
>>> On Fri, May 09, 2014 at 08:40:46PM +0800, Ling Ma wrote:
>>>> If there are still some issues on the latest memcpy and memset, please
>>>> let us know.
>>>>
>>>> Thanks
>>>> Ling
>>>>
>>>> 2014-04-21 12:52 GMT+08:00, ling.ma.program@gmail.com
>>>> <ling.ma.program@gmail.com>:
>>>> > From: Ling Ma <ling.ml@alibaba-inc.com>
>>>> >
>>>> > In this patch we take advantage of HSW memory bandwidth, manage to
>>>> > reduce miss branch prediction by avoiding using branch instructions
>>>> > and
>>>> > force destination to be aligned with avx instruction.
>>>> >
>>>> > The CPU2006 403.gcc benchmark indicates this patch improves
>>>> > performance
>>>> > from 6% to 14%.
>>>> >
>>>> > This version only jump to backward for memove overlap case,
>>>> > Thanks for Ondra'comments, and that Yuriy gave me c code hint on it.
>>>
>>> As now it is slower than a gcc compilation time becomes around
>>> 0.12% slower than pending sse2 version and indistingushible from current
>>> version.
>>>
>>> I used a benchmark that measures total running time of gcc for five
>>> hours and report relative time and variance, you could get it here
>>>
>>> http://kam.mff.cuni.cz/~ondra/memcpy_consistency_benchmark.tar.bz2
>>>
>>> a results I got on haswell are
>>>
>>>        memcpy-avx.so     memcpy-sse2.so     memcpy-sse2_v2.so
>>> memcpy_fuse.so    memcpy_rep8.so           nul.so
>>>      100.25% +- 0.04%    100.25% +- 0.04%    100.13% +- 0.07%    100.00%
>>> +-
>>> 0.04%    100.34% +- 0.13%    100.95% +- 0.07%
>>>
>>> where I tried fusion and rep strategy like in memset which helps.
>>>
>>> I tried also to measure it with my benchmark on different function, it
>>> claims that pending sse2 version is best on gcc+gnuplot load. When I
>>> looked to graph it looks that it loses on much branching until it gets
>>> to small sizes, see
>>>
>>> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_avx.html
>>> with profiler here
>>> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_avx150514.tar.bz2
>>>
>> Ling: we move less_16bytes to code entry, so there are no  degradation
>> for small size , attached code and your
>> profiler:http://www.yunos.org/tmp/memcpy_profile_avx0520.tar.gz
>> meanwhile we also tested pending memcpy, it is much better than original
>> one,
>> but avx still give us the best result for large input(we can download
>> and run it):
>> www.yunos.org/tmp/test.memcpy.memset.zip
>>
>
> Any updates on this?  Where is the latest AVX2 memcpy patch?
> I didn't see it at
>
> https://sourceware.org/ml/libc-alpha/
>
>
> --
> H.J.
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]