This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] Improving memcpy and memset.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Mon, 12 Aug 2013 22:18:48 +0200
- Subject: Re: [RFC] Improving memcpy and memset.
- References: <20130809210231 dot GA10077 at domone dot kolej dot mff dot cuni dot cz> <CAHjhQ93-6Yp6KmUdWPpSN2h_Ed3EjE-qYiywwj1fJgQj0h7DoA at mail dot gmail dot com>
Hi,
I tinkered with rep stosq/movsq it could improve performance for older
processors. It took me a while to realize that big constant factor for
__builtin_memcpy/memset was caused by a poor code generation rather that
instruction startup cost.
With effective header I could get around 5% speedup for older machines.
As i7* handles unaligned loads well my implementation is fastest there.
A rep implementation is best until certain size where vector loop takes
over. I do switch at 512 bytes for now, some bit can be squeezed by
finding architecture-specific optimums.
One exception is silvermont, where profiling shown that rep movsq is
best course of action from 512 bytes. It probably is always when used
with new header.
There is one thing that I do not understand which is that now core2
looks to pick unlikely branches as predicted which leads to bad performance
in random test.
I decided to split variants to two classes, a loop improvements are
found here:
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_loop.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_loop.html
And here is updated version of profiler.
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile120813.tar.bz2
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile120813.tar.bz2
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_loop120813.tar.bz2
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_loop120813.tar.bz2
One thing left is to add ssse3 implementation for bigger sizes. It needs
to be big to pay overhead caused by computed jump.
Second is playing with unaligned loops. I added memcpy_cntloop which has
additional counter that gets substracted by 1 in each iteration. It
is best that I have for i7's so far as it is predicted upto 512 bytes.
Comments?
On Mon, Aug 12, 2013 at 02:40:01PM +0400, Liubov Dmitrieva wrote:
That should be fixed now.
--
Traceroute says that there is a routing problem in the backbone. It's not our problem.