This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Split mantissa calculation loop and add branchprediction to mp multiplication


On Thu, 2013-01-03 at 17:19 +0100, Andreas Jaeger wrote:
> On 01/03/2013 05:18 PM, Steven Munroe wrote:
> > On Thu, 2013-01-03 at 09:08 +0530, Siddhesh Poyarekar wrote:
> >> On Wed, Jan 02, 2013 at 02:20:13PM -0600, Steven Munroe wrote:
> >>> I do not understand what you are doing here. If the intent is to replace
> >>> the X[], Y[], Z[] doubles with int's you will get overflows in Z[] if
> >>> you are changing X[], y[]. Z[] with uint64_t then you avoid the
> >>> overflows but (Z[k] + CUTTER)-CUTTER has no effect and you have not
> >>> saved any space. Also u is still a double, so you are adding some
> >>> expensive int->float->int converts to the inter loop.
> >>
> >> I don't convert mantissa to int and leave everything as is.  I had
> >> posted the patch to do that earlier, which has not been commented upon
> >> yet and that's the one you should be looking at; this patch has a
> >> different purpose:
> >>
> >> http://sourceware.org/ml/libc-alpha/2012-12/msg00354.html
> >>
> >> None of the problems you're claiming will exist because:
> >>
> >> (1) The product is computed and stored in 64-bit
> >>
> >> (2) u does not exist since it is replaced by a much simpler operation,
> >>      which results in that snippet looking like this:
> >>
> >>      int64_t tmp = Z[k];
> >>      for (i=i1,j=i2-1; i<i2; i++,j--)
> >>        tmp += (int64_t) X[i]*Y[j];
> >>
> >>      Z[k]  = (int) (tmp % (1 << 24));
> >>      Z[--k] = (int) (tmp / (1 << 24));
> >>
> > This is very bad for POWER. PowerPC has (multiple) independent fixed
> > point and floating point pipelines. This allow super-scalar out-of-order
> > execution, UNTIL you force a transfer (through memory) between the
> > FPRs/GPRs. PowerPC has lots of registers (32+32+32), we expect the
> > compiler to keep lots of data in the registers, and so we don't optimize
> > the hardware for dependent load after store, we optimize for memory
> > bandwidth.
> >
> > You proposed code forces an (unnecessary) double->long conversion and
> > FPR to GPR transfer into the inner loop, disabling any super-scalar
> > parallel execution. It also prevents loop unrolling and does not allow
> > GCC to make good use of all those registers we provide in the
> > architecture.
> >
> > So your code is optimized for (register poor, in-order-execution) X86 at
> > the expense of PowerPC.
> 
> 
> Steve, could you run the testprogram that Siddesh has mentioned and show 
> the numbers with and without the patch, please? I'd like to see the 
> actual numbers.
> 

Actually I think it is up to Siddhesh to prove that his code does not
negatively impact other platforms.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]