coreutils-6.11-1 in release-2 area

Eric Blake
Fri May 23 13:06:00 GMT 2008

Hash: SHA1

According to Christopher Faylor on 5/22/2008 8:44 PM:
| I guess we'll see if the effects of a subroutine call and probable
| non-cache locality are going to outweigh any performance improvement.

The way I see it, there are two orthogonal patches.  One is my patch which
fixes a (performance) bug on anyone who happened to invoke the library
version of strchr (guaranteed if you used a function pointer, but also
possible if you #undef strchr or use -fno-builtin, and when
__builtin_strchr can't optimize at compile time).  This is strictly an
improvement - you already had the penalty of calling the library, but now
the library is faster.  The improvements may be different across different
chips (AMD vs. Intel, and processor generation), but as the improvement is
algorithmic in nature to execute fewer branches and fewer overall assembly
instructions, and not merely tweaking assembly to favor the pipelining of
one particular chip, all of the chips should see some benefit.

The other patch is yours to quit using an inlined assembly version and
instead call the library, when compiling cygwin1.dll.  You could have made
your patch prior to mine to measure effects of function call overhead and
non-cache locality.  And you can still reinstate your patch to override
the library version.

What makes the argument more interesting is realizing that your inline
assembly operated a byte at a time, and my timing numbers show that if the
library function is implemented as one byte at a time, it is slower than
if it is implemented a word at a time.  Additionally, your inline version
does two tests per byte on strchr(p,0), whereas the library now only does
one.  My timing numbers did not show the penalty of a function call vs.
inline function.  What I suspect is that for short strings, an inline
byte-at-a-time implementation will always beat a word-at-a-time library
function.  The question for deciding what speedup or penalty we will see
is thus what percentage of strings passed through strchr fall below the
short string threshold.  I don't have any intuitive feel for where the
'short string' threshold falls; it would take actual timing tests.  It
probably also differs between chips.  I also don't have any feel for what
percentage of uses would fall under the threshold, to know whether we
would see the greater overall speedup for using the inline version even
though it makes the longer strings slower.

| I'm still not thrilled about tweaking these functions given the "It's
| 2008 and someone has already done this so why reinvent it again?"
| concept.

I'm looking at it not as writing a new implementation, so much as fixing
the (performance) bug in the existing implementation.  Again, my changes
don't affect whether or not code calls the library strchr, only that if
the library version is called it will run faster.  And there are probably
still patches that can speed up the library assembly, such as swapping to
a no-frame-pointer style of call.  Additionally, there are probably some
tweaks that would benefit i586 and newer (and certainly tweaks if you can
assume MMX or other vector operations), but newlib is only currently tuned
to the i386 common denominator (you would have to create libc/machine/i586
and get cygwin to compile against a new set of machine descriptions to use
instructions added in Pentium).

- --
Don't work too hard, make some time for fun as well!

Eric Blake   
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at
Comment: Using GnuPG with Mozilla -


More information about the Cygwin-apps mailing list