This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Ling Ma <ling dot ma dot program at gmail dot com>
- Cc: Andreas Jaeger <aj at suse dot com>, Nix <nix at esperi dot org dot uk>, libc-alpha at sourceware dot org, hongjiu dot lu at intel dot com, ling dot ml at alibaba-inc dot com
- Date: Fri, 14 Jun 2013 14:06:34 +0200
- Subject: Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
- References: <CAOGi=dPs9geCtrWhU1L_0DEfOWOknpzFSLmYs4gbYzGX8Zn5Hg at mail dot gmail dot com> <20130607104613 dot GA6343 at domone dot kolej dot mff dot cuni dot cz> <8761xqru5w dot fsf at spindle dot srvr dot nix> <CAOGi=dMV5jaS2597cksd0mW84UDd06SovsBkL5=WPez-jZWg4g at mail dot gmail dot com> <20130607160749 dot GA28961 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dP2s4k2rg8TdKwj6V9-VzbOORGzeBmh-G=Fr1eM_OyDoA at mail dot gmail dot com> <20130607184550 dot GA9683 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dN2TMG8wO97Wd1qFBUOZ7LrjTO21qP-fCPty6Mp3aOcHw at mail dot gmail dot com> <51B576CC dot 5000001 at suse dot com> <CAOGi=dMxXLsYO1=m29BkMJTFjFZ-UuyJq6s-kgWhVWJD4waFng at mail dot gmail dot com>
On Mon, Jun 10, 2013 at 09:28:30PM +0800, Ling Ma wrote:
> CPU2006 benchmark is very hard to improve so that the above 5%
> improvement for single core may become the goal of next generation
> CPU, and the improvement number is much less for benchmark specjbb. We
> hardly accept above 1% improvement of those industry benchmarks only
> for optimized memcpy_avx2 even though it is the fastest.
>
memcpy_avx2 is not fastest but 33% slower, we already know
that. I wrote it here:
> On Thu, Jun 06, 2013 at 02:55:12PM +0200, OndÅej BÃlka wrote:
> > These results show that your patch is 35% slower for gcc see following
> > line.
> >
> > Time ratio to fastest:
> > memcpy_glibc: 134.517062% memcpy_new_small: 100.000000% memcpy_new:
> > 101.120206% __memcpy_avx2: 136.926079%
Also from 403.gcc page:
"
Benchmark Description
403.gcc is based on gcc Version 3.2. It generates code for an AMD
Opteron processor. The benchmark runs as a compiler with many of its
optimization flags enabled.
"
It is not clear how this measures memcpy, I do not know how what
percentage of spec test in memcpy but based on my experience it is at
most few percent. Noise from rest of code much more than that.
You can test it by running benchmark 100 times and then use
https://en.wikipedia.org/wiki/Student%27s_t-test
if you get statisticaly significant results which I doubt.
> we presented the results because of 2 reasons:
> 1) Haswell CPU has full capability of handling indirect jump
> instruction in memmcpy_avx2 in real-world scenario.
> 2)if we continue to test the benchmark for more times, we will find
> which is better. For example we can test memcpy_avx2, memcpy_new over
> 3 times respectively , if we find which has more times of better
> results, although the difference is very small, the stable results can
> give us the right answer.
>
> Thanks
> Ling
>
>
>
>
> 2013/6/10, Andreas Jaeger <aj@suse.com>:
> > On 06/10/2013 08:17 AM, Ling Ma wrote:
> >> Last week, we separated 403.gcc from cpu2006 benchmark and compiled
> >> with additional option -mstringop-strategy=libcall to avoid rep_4byte,
> >> rep_8byte, rep_byte that use rep movs instructions. 403.gcc has plenty
> >> of branch instructions, and is very sensitive for branch prediction
> >> miss rate. Currently we are concerning about whether memcpy_avx2 cause
> >> more branch prediction miss over benefit from it in real world
> >> scenario, so 403.gcc will help us to verify it.
> >>
> >> We tested 403.gcc linked with memcpy_new, 403.gcc linked with
> >> memcpy_avx2 for 3 times respectively:
> >>
> >> 403.gcc for memcpy_new results are below: (bigger and better)
> >> 1) 67.63718
> >> 2) 66.899156
> >> 3) 66.982456
> >>
> >> 403.gcc for memcpy_avx2 results are below:
> >>
> >> 1) 66.805236
> >> 2) 67.29362
> >> 3) 67.63718
> >>
> >> Above comparison results indicate memcpy_avx2 seem to be better,
> >> and we would like to do more experiments.
> >
> >
> > If I take the arithmetic mean of these I get:
> > 67.17293066666666666666 vs 67.24534866666666666666
> >
> > That's far less than 1 percent - so not conclusive at all,
> >
> > Andreas
> > --
> > Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
> > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 NÃrnberg, Germany
> > GF: Jeff Hawn,Jennifer Guild,Felix ImendÃrffer,HRB16746 (AG NÃrnberg)
> > GPG fingerprint = 93A3 365E CE47 B889 DF7F FED1 389A 563C C272 A126
> >