This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: Nix <nix at esperi dot org dot uk>, libc-alpha at sourceware dot org, hongjiu dot lu at intel dot com
- Date: Mon, 10 Jun 2013 14:17:17 +0800
- Subject: Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
- References: <CAOGi=dMiD=_Qf1EJ=F3hfyQDtQubDEC5pjpXKDCHrUQwhr=vzg at mail dot gmail dot com> <20130605161954 dot GA26401 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dPWPaX5prcL-uAaqS6=_ehzKeBmAFMdwV6aU34jZ0eHtQ at mail dot gmail dot com> <20130606125511 dot GA28565 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dPs9geCtrWhU1L_0DEfOWOknpzFSLmYs4gbYzGX8Zn5Hg at mail dot gmail dot com> <20130607104613 dot GA6343 at domone dot kolej dot mff dot cuni dot cz> <8761xqru5w dot fsf at spindle dot srvr dot nix> <CAOGi=dMV5jaS2597cksd0mW84UDd06SovsBkL5=WPez-jZWg4g at mail dot gmail dot com> <20130607160749 dot GA28961 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dP2s4k2rg8TdKwj6V9-VzbOORGzeBmh-G=Fr1eM_OyDoA at mail dot gmail dot com> <20130607184550 dot GA9683 at domone dot kolej dot mff dot cuni dot cz>
Last week, we separated 403.gcc from cpu2006 benchmark and compiled
with additional option -mstringop-strategy=libcall to avoid rep_4byte,
rep_8byte, rep_byte that use rep movs instructions. 403.gcc has plenty
of branch instructions, and is very sensitive for branch prediction
miss rate. Currently we are concerning about whether memcpy_avx2 cause
more branch prediction miss over benefit from it in real world
scenario, so 403.gcc will help us to verify it.
We tested 403.gcc linked with memcpy_new, 403.gcc linked with
memcpy_avx2 for 3 times respectively:
403.gcc for memcpy_new results are below: (bigger and better)
1) 67.63718
2) 66.899156
3) 66.982456
403.gcc for memcpy_avx2 results are below:
1) 66.805236
2) 67.29362
3) 67.63718
Above comparison results indicate memcpy_avx2 seem to be better,
and we would like to do more experiments.
Thanks
Ling
2013/6/8, OndÅej BÃlka <neleai@seznam.cz>:
> On Sat, Jun 08, 2013 at 12:12:56AM +0800, Ling Ma wrote:
>> > First it does not randomize size in any way. This will cause branches
>> > to
>> > be predicted and as branch prediction can account to 20% of time
>> > results
>> > you get will be 20% off.
>> Ling: Because "A widely held rule of thumb is that a program spends
>> 90% of its execution time in only 10% of the code", so hardware
>> implemented branch prediction mechanism, stable pattern history
>> provide benchmark(SPEC 2000) with average 95% correct prediction,
>> fully reandom code will make it useless.
>>
> And are you sure that it is relevant for memcpy? Compile and run simple
> program below.
>
> gcc -fPIC -shared memcpy.c -o memcpy.so
> LD_PRELOAD=./memcpy.so bash 2> memcpy_input
>
> It will record alignments and sizes of each memcpy call you do in that
> shell. You can see how random they are.
>
>> > Fox example as you ran
>> > ./memcpy-test-avx2-bench
>> > cpy frequency could be 800MHz
>> > then in
>> > ./memcpy-test-new-bench
>> > a governor can decide to switch to 2.5GHz making results above three
>> > times worse than they are.
>> Ling: I can confirm it is not issue in my compare.html, but like to
>> send out double-check result.
>>
>> Ondra, if we can test real benchmark, that will more approximate our
>> real world usage. So some people know good memcpy benchmarks which
>> represent the real world applications, and could you please tell us ?
>>
> One I posted.
>> Thanks & Best Regards
>> Ling
>
>
>
> #include <stdio.h>
> #undef memcpy
> void *memcpy(void *_x,const void *_y,size_t n){
> char *x=_x,*y=_y;
> int i;
>
> for(i=0;i<n;i++){
> x[i]=y[i];
> }
> fprintf(stderr,"memcpy:%i dest %i src %i
> size\n",((int)_x)%64,((int)_y)%64,n);
> return x;
> }
>
>