This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Ling Ma <ling dot ma dot program at gmail dot com>
Cc: libc-alpha at sourceware dot org, aj at suse dot com, liubov dot dmitrieva at gmail dot com, Ma Ling <ling dot ml at alibaba-inc dot com>
Date: Tue, 30 Jul 2013 08:28:14 +0200
Subject: Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
References: <1375090922-8418-1-git-send-email-ling dot ma dot program at gmail dot com> <20130729171927 dot GA12218 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dNY9KP_OdGNW79iLiCHu4L=8fCNFg=ZZpMiRFN0CHJZ1g at mail dot gmail dot com> <20130730044925 dot GA6890 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dO_=rku9NG_8V1OFy7ZR2UZNT0h1B_oT4KP2t-asCdX8g at mail dot gmail dot com>

On Tue, Jul 30, 2013 at 01:35:49PM +0800, Ling Ma wrote:
> >> >> +L(less_128bytes):
> >> >> +	xor	%esi, %esi
> >> >> +	mov	%ecx, %esi
> >> > And this? A C equivalent of this is
> >> > x = 0;
> >> > x = y;
> >> Ling: we used mov %sil, %cl in above code, now %esi become  as
> >> destination register(mov %ecx, %esi),  there is one false dependence
> >> hazard, we use xor r1, r1 to ask decode stage to break the dependence,
> >> and insight pipeline xor r1, r1  will be removed  before entering into
> >> execution stage.
> >>
> > That is pointless as mov breaks false dependencies.
> >
> > Anyway a code you use is redudnand. You already have that computed so
> > simple mov %xmm0, %rcx will do a job.
> 
> Ling: Usually rename stage can help us to resolve most of WAR, WAW,
> but we use %sil, instead of %esi, which is related with patial
> register access.
It does not matter. Also you have plenty of free registers available 
> i remember mov xmm0, r32/64  will cause cross-domain operation, it is
> not good on nehalem, i may test whether it exists on haswell.
Wrong again even on nehalem. I tested your and xmm code and you are 50%
slowerand you are 50% slower (and I am not counting that computing
pshufb is free in our case). 

A time in seconds to calculate broadcast 1000000000 times is:
your    sse
0.37	0.23
0.36	0.23
0.36	0.23
0.36	0.24
0.36	0.23
0.36	0.23
0.37	0.24
0.36	0.23
0.36	0.23
0.36	0.23

In  attached benchmark.

Attachment: test_broadcast.tar.bz2
Description: Binary data

Follow-Ups:
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: Ling Ma

References:
- [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: ling . ma . program
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: Ling Ma
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: Ling Ma

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]