This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 17 Sep 2013 08:05:50 +0200
- Subject: Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
- Authentication-results: sourceware.org; auth=none
- References: <523715EE dot 9070408 at linux dot vnet dot ibm dot com> <20130916183347 dot GA32742 at domone> <52375A73 dot 4080906 at linux dot vnet dot ibm dot com> <20130916214530 dot GA7907 at domone> <5237AF1B dot 6040103 at linux dot vnet dot ibm dot com>
On Mon, Sep 16, 2013 at 10:23:39PM -0300, Adhemerval Zanella wrote:
> On 16-09-2013 18:45, OndÅej BÃlka wrote:
> > On Mon, Sep 16, 2013 at 04:22:27PM -0300, Adhemerval Zanella wrote:
> >> On 16-09-2013 15:33, OndÅej BÃlka wrote:
> >> I'm not sure if I understood your questioning. By handling '8 bytes separately'
> >> you mean what exactly? The patch added the doubleword case for aligned strings
> >> where for default algorithm and POWER7 implementation both will exit in first
> >> check after extrdi. check.
> >>
> > I mean implementations like this one, I wrote this mostly to show
> > unaligned store I there are several additional optimizations possible.
> >
> > char *strppy(char *x,char *y){
> > uint64_t ones = 0x0101010101010101
> > uint64_t mask = 80 * ones;
> >
> > /* handle strings less than 8 bytes and align x */
> >
> > while (1)
> > {
> > #define has_zero(x) ((((uint64_t(*x) | mask) - ones) & mask) ^ mask
> > if (has_zero(x))
> > {
> > int first = ffs (has_zero (x)) / 8;
> > *((uint64_t)(x - 8 + first + 1)) = *((uint64_t)(y - 8 + first + 1));
> > return x + first;
> > }
> > *((uint64_t)(x)) = *((uint64_t)(y));
> > x += 8;
> > y += 8;
> > }
> > }
> >
> >
> > uint64_t extract = 0x0102040810204080;
>
> Well, besides the fact the algorithm seems wrong (I tried and it failed o test-stpcpy, but
> I understand your did as a scratch),
How could you expect to work when it is clearly only part of implementation? You would need a
to actually implement code that I omitted for simplicity and described by:
/* handle strings less than 8 bytes and align x */
> the idea of my patch is not optimize *unaligned* access,
They are assumed to be aligned as comment above says.
> but rather the aligned ones. I plan to investigate the aligned case later, since on POWER
> unaligned accesses are costly.
>
You cannot dissmiss altetive because one part is costy but you need to
analyze costs of each alternative.
What I do avoids writing one part of string byte by byte and branch
misprediction to get end which could be even more expensive than one
unaligned access.
>
> >
> >> For general implementation the algorithm used to find '0' bytes is explain at
> >> sysdeps/powerpc/powerpc64/stplen.S. I decided to use the first option on inner
> >> loops as the implementation already uses it for words. For POWER7 the algorithm
> >> is simpler and the only thing we need to do is shift and check for NULLs after
> >> the loop.
> >>
> > There is separate issue that a strlen code is little ineffective. In
> > that loop one branch is retundant. A better implementation has following
> > loop.
> > I optimized this for asymptotic complexity, extracting length could
> > be made more effectively in particular cases.
> >
> > size_t strlen(char *x)
> > {
> > uint64_t ones = 0x0101010101010101
> > uint64_t mask = 80 * ones;
> > uith64_t extract = 0x0102040810204080;
> >
> > /* handle first 16 bytes */
> >
> > char *x0 = x;
> >
> > /* align x */
> >
> > while (1)
> > {
> > uint64_t m0 = (*(uint64_t)(x)) | mask - ones;
> > uint64_t m8 = (*(uint64_t)(x + 8)) | mask - ones;
> > uint64_t r0,r8;
> > if (r8 = ( (m0 & m8 & mask) ^ mask))
> > {
> > uint64_t r0 = (m0 & mask) ^ mask;
> > int bitmask0 = ((r0 >> 7) * extract) >> 56;
> > int bitmask8 = ((r8 >> 7) * extract) >> 56;
> > int first = ffs (bitmask0 | (bitmask8 << 8));
> > return x - x0 + first;
> > }
> > x += 16;
> > }
> > }
>
> Again, this assumes unaligned cases are not costly,
See comment /* align x */, there are no unaligned accesses there
>which is not the case for POWER. I'll
> investigate it later, but if you check on strlen optimization for POWER7 there is a *lot*
> of care of *not* doing unaligned memory accesses. This is case for all string optimizations
> for POWER7.
>
>
> >
> >>> Second is how slow are overlapping stores versus branch misprediction.
> >>> You need a benchmark that will vary sizes to check this, I could supply
> >>> one.
> >>>
> >> You can also add the benchmark on existing GLIBC infrastructure. For POWER7, the
> >> branch misprediction is quite smart and form my experience trying to use branch
> >> prediction instruction usually is not a good idea.
> > I am not interested on branch prediction instructions but on cost of
> > mispredictions, in particular what is difference in time of following benchmark
> > when we switch loops by -DRAND
> >
> For current implementation none, for my patch I see an improvement of 60% for the case
> without DRAND (DRAND time does not change).