This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7

From: Steven Munroe <munroesj at linux dot vnet dot ibm dot com>
To: Richard Henderson <rth at twiddle dot net>
Cc: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, libc-alpha at sourceware dot org
Date: Wed, 18 Sep 2013 11:52:32 -0500
Subject: Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
Authentication-results: sourceware.org; auth=none
References: <523715EE dot 9070408 at linux dot vnet dot ibm dot com> <20130917061516 dot GA30130 at bubble dot grove dot modra dot org> <5239B9FE dot 4090506 at linux dot vnet dot ibm dot com> <5239CC7B dot 5010804 at twiddle dot net>
Reply-to: munroesj at us dot ibm dot com

On Wed, 2013-09-18 at 08:53 -0700, Richard Henderson wrote:
> On 09/18/2013 07:34 AM, Adhemerval Zanella wrote:
> > +	extrdi.	rTMP, rALT, 8, 0
> > +	stbu	rTMP, 8(rRTN)
> > +	beqlr
> > +	extrdi.	rTMP, rALT, 8, 8
> > +	stbu	rTMP, 1(rRTN)
> > +	beqlr
> > +	extrdi.	rTMP, rALT, 8, 16
> > +	stbu	rTMP, 1(rRTN)
> > +	beqlr
> > +	extrdi.	rTMP, rALT, 8, 24
> > +	stbu	rTMP, 1(rRTN)
> > +	beqlr
> > +	extrdi.	rTMP, rALT, 8, 32
> > +	stbu	rTMP, 1(rRTN)
> > +	beqlr
> > +	extrdi.	rTMP, rALT, 8, 40
> > +	stbu	rTMP, 1(rRTN)
> > +	beqlr
> > +	extrdi.	rTMP, rALT, 8, 48
> > +	stbu	rTMP, 1(rRTN)
> > +	beqlr
> > +	stbu	rALT, 1(rRTN)
> 
> I, like Ondrej, have trouble believing that 4 arithmetic insns + 1 unaligned
> load + 1 unaligned store is slower than this compare-branch ladder.
> 
> However good Power7's branch predictor is, I bet its out-of-order insn
> scheduler is better.  Issue the 6 insns, return from subroutine, surely.
> 
> You've got the location of the zero in rMASK from cmpb:
> 
>   cntlzd   rMASK, rMASK      // extract bit offset of nul byte
>   srdi     rMASK, rMASK, 3   // convert bit offset to byte offset
>   addi     rALT, rMASK, -7   // include the previous 7 bytes plus the nul
>   ldx      rTMP, rSRC, rALT  // perform one last unaligned copy
>   stdx     rTMP, rRTN, rALT
>   add      rRTN, rRTN, rMASK // adjust the return value
>   blr
> 

With unaligned load/stores we have to worry about crossing page
boundaries and perhaps a segfault. I don't see how you proposal
addresses that issue.

Also I would like the see the scrollpipe trace that backs up you
assertion.

The aggressive out-of-order nature of POWER and power7 specifically,
tends to give different results then our mental model based on our
experience with other processors.

Follow-Ups:
- Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
  - From: Richard Henderson

References:
- [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
  - From: Adhemerval Zanella
- Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
  - From: Alan Modra
- Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
  - From: Adhemerval Zanella
- Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
  - From: Richard Henderson

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]