This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction


Last week, we separated 403.gcc from cpu2006 benchmark and compiled
with additional option -mstringop-strategy=libcall to avoid rep_4byte,
rep_8byte, rep_byte that use rep movs instructions. 403.gcc has plenty
of branch instructions, and is very sensitive for branch prediction
miss rate. Currently we are concerning about whether memcpy_avx2 cause
more branch prediction miss over benefit from it in real world
scenario, so 403.gcc will help us to verify it.

We tested 403.gcc linked with memcpy_new, 403.gcc linked with
memcpy_avx2 for 3 times respectively:

403.gcc for memcpy_new results are below: (bigger and better)
1) 67.63718
2) 66.899156
3) 66.982456

403.gcc for memcpy_avx2 results are below:

1) 66.805236
2) 67.29362
3) 67.63718

Above comparison results indicate memcpy_avx2 seem to be better,
and we would like to do more experiments.

Thanks
Ling

2013/6/8, OndÅej BÃlka <neleai@seznam.cz>:
> On Sat, Jun 08, 2013 at 12:12:56AM +0800, Ling Ma wrote:
>> > First it does not randomize size in any way. This will cause branches
>> > to
>> > be predicted and as branch prediction can account to 20% of time
>> > results
>> > you get will be 20% off.
>> Ling: Because "A widely held rule of thumb is that a program spends
>> 90% of its execution time in only 10% of the code",  so hardware
>> implemented  branch prediction mechanism, stable pattern history
>> provide benchmark(SPEC 2000) with average 95% correct prediction,
>> fully reandom code will make it useless.
>>
> And are you sure that it is relevant for memcpy? Compile and run simple
> program below.
>
> gcc -fPIC -shared memcpy.c -o memcpy.so
> LD_PRELOAD=./memcpy.so bash 2> memcpy_input
>
> It will record alignments and sizes of each memcpy call you do in that
> shell. You can see how random they are.
>
>> > Fox example as you ran
>> > ./memcpy-test-avx2-bench
>> > cpy frequency could be 800MHz
>> > then in
>> > ./memcpy-test-new-bench
>> > a governor can decide to switch to 2.5GHz making results above three
>> > times worse than they are.
>> Ling:  I can confirm it is not issue in my compare.html, but like to
>> send out double-check result.
>>
>> Ondra, if we can test real benchmark, that will more approximate our
>> real world usage. So some people know good memcpy benchmarks which
>> represent the real world applications, and could you please tell us ?
>>
> One I posted.
>> Thanks & Best Regards
>> Ling
>
>
>
> #include <stdio.h>
> #undef memcpy
> void *memcpy(void *_x,const void *_y,size_t n){
>         char *x=_x,*y=_y;
>         int i;
>
>   for(i=0;i<n;i++){
>     x[i]=y[i];
>   }
>         fprintf(stderr,"memcpy:%i dest %i src %i
> size\n",((int)_x)%64,((int)_y)%64,n);
>         return x;
> }
>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]