This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: More backref performance - no strncmp
> I disagree. If it is so, then i386 memcmp should start with a length check
> and below certain threshold use different method with smaller startup
> overhead.
i386 memcmp is a cmpsb. Wonderful on an i586 (8+4*n cycles) or i686 (12+9*n
uops) processor :-/ No surprise that my_memcmp, especially if compiled with
a recent gcc that knows what's happening inside the processor, is faster.
generic memcmp actually does a lot of attempts to optimize. But on
little-endian machines the magical... ehm, highly tuned algorithm that it
sets up boils down to a manually scheduled but not-even-very-unrolled loop
:-(
#ifdef WORDS_BIGENDIAN
# define CMP_LT_OR_GT(a, b) ((a) > (b) ? 1 : -1)
#else
# define CMP_LT_OR_GT(a, b) memcmp_bytes ((a), (b))
#endif
...
a0 = ((op_t *) srcp1)[0];
b0 = ((op_t *) srcp2)[0];
...
if (a0 != b0)
return CMP_LT_OR_GT (a1, b1);
I don't have a compiler at hand, but I suspect that the code that this
produces is awful to say the least. You are fetching a word, storing it on
the stack, and comparing it byte by byte with a loop (memcmp_bytes) that is
not even unrolled!!!
I hold high respect for Torbjorn, but IMO this code is crap :-( memcmp
should first of all be rewritten to work on op_t* values instead of op_t
(dropping the local variables: in little-endian code they are stupid, in
big-endian code they would be recreated as compiler-generated temporaries).
> If the overhead you're seeing is because of the call, then IMHO it should
> be enough to check one byte inline, with __builtin_expect.
Not because of the call (a non-inlined my_memcmp does not show significantly
different timings), but because of overheads in memcmp.
Assume a "wonderful" memcmp without the problems in i386 and generic
memcmp's that I highlighted above. Then even for large lengths, the two
areas we are comparing are inside a string, and the pointers move a byte at
a time, so we only get benefits rarely from alignment (1/16 to have both
strings aligned, 1/4 to have one of them).
I see why you're skeptical, but trust the numbers. my_memcmp is 5% faster
than strncmp (wrong); memcmp is 15% slower.
Paolo