This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Fri, Nov 01, 2013 at 01:44:17PM -0700, Paul Eggert wrote: > On 11/01/2013 10:58 AM, OndÅej BÃlka wrote: > > > I got similar slowdown on core2, nehalem and fx10 machines. > > Conversely, I saw a 2x speedup on my platform, an AMD > Deneb (Phenom II X4 910e): > > $ gcc -O2 assembly.c && time ./a.out > > real 0m2.096s > user 0m2.095s > sys 0m0.001s > $ gcc -O2 branchfree.c && time ./a.out > > real 0m1.057s > user 0m1.054s > sys 0m0.002s > Weird as I cannot get these on athlon X2 and phenom X6. As one iteration takes 2.096 * 2600 / 340 = 16 cycles a slowdown is 8 cycles which is hard to explain. I attached binaries which were used to test (gcc version 4.4.5 (Debian 4.4.5-8)) >From possible causes all seem unlikely. For frequency switching a binary runs too long. A branch is slower when it jumps close to end of 16 byte boundary, so I tried to move loop byte-by byte but performance stayed rougthly same. What is left is that branch cache conflict. > > > As code size is concerned my assembly has 8 extra bytes > > (jump 2, xor 3, neg 3). When I use sbb trick from article > > I could decrease that to 5. > > 5 bytes more than what we're doing now, > or 5 bytes more than the branchfree version? > I'm worried about code bloat compared to > what we're doing now. No, overhead is versus version with no checking.
Attachment:
assembly
Description: Binary data
Attachment:
branchfree
Description: Binary data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |