This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: More backref performance - no strncmp


> I disagree. If it is so, then i386 memcmp should start with a length check
> and below certain threshold use different method with smaller startup
> overhead.

i386 memcmp is a cmpsb.  Wonderful on an i586 (8+4*n cycles) or i686 (12+9*n
uops) processor :-/  No surprise that my_memcmp, especially if compiled with
a recent gcc that knows what's happening inside the processor, is faster.

generic memcmp actually does a lot of attempts to optimize.  But on
little-endian machines the magical... ehm, highly tuned algorithm that it
sets up boils down to a manually scheduled but not-even-very-unrolled loop
:-(

#ifdef WORDS_BIGENDIAN
# define CMP_LT_OR_GT(a, b) ((a) > (b) ? 1 : -1)
#else
# define CMP_LT_OR_GT(a, b) memcmp_bytes ((a), (b))
#endif

...

      a0 = ((op_t *) srcp1)[0];
      b0 = ((op_t *) srcp2)[0];
      ...
      if (a0 != b0)
        return CMP_LT_OR_GT (a1, b1);

I don't have a compiler at hand, but I suspect that the code that this
produces is awful to say the least.  You are fetching a word, storing it on
the stack, and comparing it byte by byte with a loop (memcmp_bytes) that is
not even unrolled!!!

I hold high respect for Torbjorn, but IMO this code is crap :-(  memcmp
should first of all be rewritten to work on op_t* values instead of op_t
(dropping the local variables: in little-endian code they are stupid, in
big-endian code they would be recreated as compiler-generated temporaries).

> If the overhead you're seeing is because of the call, then IMHO it should
> be enough to check one byte inline, with __builtin_expect.

Not because of the call (a non-inlined my_memcmp does not show significantly
different timings), but because of overheads in memcmp.

Assume a "wonderful" memcmp without the problems in i386 and generic
memcmp's that I highlighted above.  Then even for large lengths, the two
areas we are comparing are inside a string, and the pointers move a byte at
a time, so we only get benefits rarely from alignment (1/16 to have both
strings aligned, 1/4 to have one of them).

I see why you're skeptical, but trust the numbers.  my_memcmp is 5% faster
than strncmp (wrong); memcmp is 15% slower.

Paolo



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]