This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org, aj at suse dot com, liubov dot dmitrieva at gmail dot com, Ma Ling <ling dot ml at alibaba-inc dot com>
- Date: Tue, 30 Jul 2013 15:02:48 +0800
- Subject: Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
- References: <1375090922-8418-1-git-send-email-ling dot ma dot program at gmail dot com> <20130729171927 dot GA12218 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dNY9KP_OdGNW79iLiCHu4L=8fCNFg=ZZpMiRFN0CHJZ1g at mail dot gmail dot com> <20130730044925 dot GA6890 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dO_=rku9NG_8V1OFy7ZR2UZNT0h1B_oT4KP2t-asCdX8g at mail dot gmail dot com> <20130730062814 dot GA8185 at domone dot kolej dot mff dot cuni dot cz>
Thanks for your code, I run on haswell as below:
[root@localhost test_broadcast]# ./test
0.19 0.19
0.19 0.19
0.19 0.19
0.19 0.19
0.19 0.19
0.19 0.19
0.19 0.19
0.19 0.19
0.19 0.19
0.19 0.19
2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
> On Tue, Jul 30, 2013 at 01:35:49PM +0800, Ling Ma wrote:
>> >> >> +L(less_128bytes):
>> >> >> + xor %esi, %esi
>> >> >> + mov %ecx, %esi
>> >> > And this? A C equivalent of this is
>> >> > x = 0;
>> >> > x = y;
>> >> Ling: we used mov %sil, %cl in above code, now %esi become as
>> >> destination register(mov %ecx, %esi), there is one false dependence
>> >> hazard, we use xor r1, r1 to ask decode stage to break the dependence,
>> >> and insight pipeline xor r1, r1 will be removed before entering into
>> >> execution stage.
>> >>
>> > That is pointless as mov breaks false dependencies.
>> >
>> > Anyway a code you use is redudnand. You already have that computed so
>> > simple mov %xmm0, %rcx will do a job.
>>
>> Ling: Usually rename stage can help us to resolve most of WAR, WAW,
>> but we use %sil, instead of %esi, which is related with patial
>> register access.
> It does not matter. Also you have plenty of free registers available
>> i remember mov xmm0, r32/64 will cause cross-domain operation, it is
>> not good on nehalem, i may test whether it exists on haswell.
> Wrong again even on nehalem. I tested your and xmm code and you are 50%
> slowerand you are 50% slower (and I am not counting that computing
> pshufb is free in our case).
>
> A time in seconds to calculate broadcast 1000000000 times is:
> your sse
> 0.37 0.23
> 0.36 0.23
> 0.36 0.23
> 0.36 0.24
> 0.36 0.23
> 0.36 0.23
> 0.37 0.24
> 0.36 0.23
> 0.36 0.23
> 0.36 0.23
>
> In attached benchmark.
>
>