This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7


On 18-09-2013 17:40, Richard Henderson wrote:
> Hmm.  That's register clobbering there.  Gcc 4.7.2 generated
>
>     10000654:   7d 08 00 74     cntlzd  r8,r8
>     10000658:   79 08 e8 c2     rldicl  r8,r8,61,3
>     1000065c:   38 e8 ff f9     addi    r7,r8,-7
>     10000660:   7c ca 38 2a     ldx     r6,r10,r7
>     10000664:   7c de 39 2a     stdx    r6,r30,r7
>
> Ah, wrong constraints on my asm, that just so happened to work here.  Change
> all "=r" to "=&r" so that ralt et al does not overlap rsrc.
>
>
>
> r~

Thanks for the review and I have checked your suggestion with the modification
below on top of my patch. We still need the load/compare/store sequence to
avoid unaligned access to first doubleword.

diff --git a/sysdeps/powerpc/powerpc64/power7/stpcpy.S b/sysdeps/powerpc/powerpc64/power7/stpcpy.S
index 65ff6a0..116e8ee 100644
--- a/sysdeps/powerpc/powerpc64/power7/stpcpy.S
+++ b/sysdeps/powerpc/powerpc64/power7/stpcpy.S
@@ -41,14 +41,18 @@ EALIGN (__stpcpy, 4, 0)
        li      rMASK, 0
        addi    rRTN, rRTN, -8
        ld      rWORD, 0(rSRC)
-       b       L(g2)
+       cmpb    rTMP, rWORD, rMASK
+       cmpdi   rTMP, 0
+       beq     L(g0)
+       mr      rALT, rWORD
+       b       L(g1)
 
        .align 4
 L(g0): ldu     rALT, 8(rSRC)
        stdu    rWORD, 8(rRTN)
        cmpb    rTMP, rALT, rMASK
        cmpdi   rTMP, 0
-       bne     L(g1)
+       bne     L(test)
        ldu     rWORD, 8(rSRC)
        stdu    rALT, 8(rRTN)
 L(g2): cmpb    rTMP, rWORD, rMASK
@@ -56,6 +60,16 @@ L(g2):       cmpb    rTMP, rWORD, rMASK
        beq     L(g0)
 
        mr      rALT, rWORD
+L(test):
+       addi    rRTN, rRTN, 8
+       cntlzd  rMASK, rTMP       /* Extract bit offset of null byte.  */
+       srdi    rMASK, rMASK, 3   /* Convert bit offset to byte offset.  */
+       addi    rALT, rMASK, -7   /* Include the previous 7 bytes + nul.  */
+       ldx     rTMP, rSRC, rALT  /* Perform one last unaligned copy.  */
+       stdx    rTMP, rRTN, rALT
+       add     rRTN, rRTN, rMASK /* Adjust the return value.  */
+       blr
+

And the results in the attached file (I used the stpcpy benchtest). As you can see
my initial patch still shows slight better latency.


Attachment: bench-stpcpy-patch.out
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]