This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Mon, 16 Sep 2013 23:45:30 +0200
- Subject: Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
- Authentication-results: sourceware.org; auth=none
- References: <523715EE dot 9070408 at linux dot vnet dot ibm dot com> <20130916183347 dot GA32742 at domone> <52375A73 dot 4080906 at linux dot vnet dot ibm dot com>
On Mon, Sep 16, 2013 at 04:22:27PM -0300, Adhemerval Zanella wrote:
> On 16-09-2013 15:33, OndÅej BÃlka wrote:
> > On Mon, Sep 16, 2013 at 11:30:06AM -0300, Adhemerval Zanella wrote:
> >> Hi all,
> >>
> >> Following Alan Modra suggestion, it is a stpcpy optimization patch for PPC64.
> >> This patch optimizes the default PPC64 by adding doubleword stores/loads
> >> increasing aligned throughput for large sizes.
> >>
> > A obvious question here is why it needs be keept separate from strcpy
> > implementation. A sysdeps/powerpc/powerpc64/st[pr]cpy.S are quite
> > similar and I do not see a sysdeps/powerpc/powerpc64/power7/strcpy.S
> > file. Would same optimization apply to strcpy?
>
> Thanks for the review OndÅej. I didn't check on strcpy yet, but I plan to. Regarding
> implementation, indeed stpcpy and strcpy are quite similar, but I think assembly
> implementation are already more difficult to maintain and throwing a lot of ifdef
> won't make any easier.
>
They are simpler to keep in sync than having several implementations
that duplicate same logic.
> >
> > Also I noted in implementation that provided that if we handle case of
> > writing less than 8 bytes separately a best way how finish on x64 would
> > be compute end and do overlapping store of last 8 bytes.
> >
> > There are two things I do not know, first one is computing end, on wikipedia
> > I found that it this could be handled by cntlz on mask and dividing it by 8.
>
> I'm not sure if I understood your questioning. By handling '8 bytes separately'
> you mean what exactly? The patch added the doubleword case for aligned strings
> where for default algorithm and POWER7 implementation both will exit in first
> check after extrdi. check.
>
I mean implementations like this one, I wrote this mostly to show
unaligned store I there are several additional optimizations possible.
char *strppy(char *x,char *y){
uint64_t ones = 0x0101010101010101
uint64_t mask = 80 * ones;
/* handle strings less than 8 bytes and align x */
while (1)
{
#define has_zero(x) ((((uint64_t(*x) | mask) - ones) & mask) ^ mask
if (has_zero(x))
{
int first = ffs (has_zero (x)) / 8;
*((uint64_t)(x - 8 + first + 1)) = *((uint64_t)(y - 8 + first + 1));
return x + first;
}
*((uint64_t)(x)) = *((uint64_t)(y));
x += 8;
y += 8;
}
}
uint64_t extract = 0x0102040810204080;
> For general implementation the algorithm used to find '0' bytes is explain at
> sysdeps/powerpc/powerpc64/stplen.S. I decided to use the first option on inner
> loops as the implementation already uses it for words. For POWER7 the algorithm
> is simpler and the only thing we need to do is shift and check for NULLs after
> the loop.
>
There is separate issue that a strlen code is little ineffective. In
that loop one branch is retundant. A better implementation has following
loop.
I optimized this for asymptotic complexity, extracting length could
be made more effectively in particular cases.
size_t strlen(char *x)
{
uint64_t ones = 0x0101010101010101
uint64_t mask = 80 * ones;
uith64_t extract = 0x0102040810204080;
/* handle first 16 bytes */
char *x0 = x;
/* align x */
while (1)
{
uint64_t m0 = (*(uint64_t)(x)) | mask - ones;
uint64_t m8 = (*(uint64_t)(x + 8)) | mask - ones;
uint64_t r0,r8;
if (r8 = ( (m0 & m8 & mask) ^ mask))
{
uint64_t r0 = (m0 & mask) ^ mask;
int bitmask0 = ((r0 >> 7) * extract) >> 56;
int bitmask8 = ((r8 >> 7) * extract) >> 56;
int first = ffs (bitmask0 | (bitmask8 << 8));
return x - x0 + first;
}
x += 16;
}
}
> > Second is how slow are overlapping stores versus branch misprediction.
> > You need a benchmark that will vary sizes to check this, I could supply
> > one.
> >
> You can also add the benchmark on existing GLIBC infrastructure. For POWER7, the
> branch misprediction is quite smart and form my experience trying to use branch
> prediction instruction usually is not a good idea.
I am not interested on branch prediction instructions but on cost of
mispredictions, in particular what is difference in time of following benchmark
when we switch loops by -DRAND
int main()
{
int i, j;
char src[33];
memset(src, 1, 32);
char dest[1000];
src[32]=0;
int sizes[1000];
int sizes2[1000];
for (i = 0; i < 1000; i++)
{
sizes[i] = rand() % 32;
sizes2[i] = rand() % 500;
}
#ifdef RAND
for (i = 0; i < 1000; i++)
for (j = 0; j < 10000; j++)
#else
for (j = 0; j < 10000; j++)
for (i = 0; i < 1000; i++)
#endif
strcpy (dest + sizes2[j], src + sizes[j]);
}