This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: More backref performance - no strncmp

From: "Paolo Bonzini" <paolo dot bonzini at polimi dot it>
To: <libc-alpha at sources dot redhat dot com>
Date: Thu, 24 Oct 2002 12:02:05 +0200
Subject: Re: More backref performance - no strncmp
References: <006a01c27a79$dd47fff0$b71a1897@bonz> <20021023152659.GA10019@nevyn.them.org> <00c401c27b36$d9ee03a0$4706af83@rett.polimi.it> <20021024110308.H3451@sunsite.ms.mff.cuni.cz> <00d801c27b3e$911ed160$4706af83@rett.polimi.it> <20021024053302.C5659@devserv.devel.redhat.com>
Reply-to: "Paolo Bonzini" <bonzini at gnu dot org>

> I disagree. If it is so, then i386 memcmp should start with a length check
> and below certain threshold use different method with smaller startup
> overhead.

i386 memcmp is a cmpsb.  Wonderful on an i586 (8+4*n cycles) or i686 (12+9*n
uops) processor :-/  No surprise that my_memcmp, especially if compiled with
a recent gcc that knows what's happening inside the processor, is faster.

generic memcmp actually does a lot of attempts to optimize.  But on
little-endian machines the magical... ehm, highly tuned algorithm that it
sets up boils down to a manually scheduled but not-even-very-unrolled loop
:-(

#ifdef WORDS_BIGENDIAN
# define CMP_LT_OR_GT(a, b) ((a) > (b) ? 1 : -1)
#else
# define CMP_LT_OR_GT(a, b) memcmp_bytes ((a), (b))
#endif

...

      a0 = ((op_t *) srcp1)[0];
      b0 = ((op_t *) srcp2)[0];
      ...
      if (a0 != b0)
        return CMP_LT_OR_GT (a1, b1);

I don't have a compiler at hand, but I suspect that the code that this
produces is awful to say the least.  You are fetching a word, storing it on
the stack, and comparing it byte by byte with a loop (memcmp_bytes) that is
not even unrolled!!!

I hold high respect for Torbjorn, but IMO this code is crap :-(  memcmp
should first of all be rewritten to work on op_t* values instead of op_t
(dropping the local variables: in little-endian code they are stupid, in
big-endian code they would be recreated as compiler-generated temporaries).

> If the overhead you're seeing is because of the call, then IMHO it should
> be enough to check one byte inline, with __builtin_expect.

Not because of the call (a non-inlined my_memcmp does not show significantly
different timings), but because of overheads in memcmp.

Assume a "wonderful" memcmp without the problems in i386 and generic
memcmp's that I highlighted above.  Then even for large lengths, the two
areas we are comparing are inside a string, and the pointers move a byte at
a time, so we only get benefits rarely from alignment (1/16 to have both
strings aligned, 1/4 to have one of them).

I see why you're skeptical, but trust the numbers.  my_memcmp is 5% faster
than strncmp (wrong); memcmp is 15% slower.

Paolo

References:
- More backref performance - no strncmp
  - From: Bonzini
- Re: More backref performance - no strncmp
  - From: Daniel Jacobowitz
- Re: More backref performance - no strncmp
  - From: Paolo Bonzini
- Re: More backref performance - no strncmp
  - From: Jakub Jelinek

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]