This is the mail archive of the
libc-ports@sources.redhat.com
mailing list for the libc-ports project.
Re: [PATCH] Optimize MIPS memcpy
- From: Maxim Kuvyrkov <maxim_kuvyrkov at mentor dot com>
- To: Steve Ellcey <sellcey at mips dot com>
- Cc: Andrew T Pinski <pinskia at gmail dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>, <libc-ports at sourceware dot org>
- Date: Wed, 5 Sep 2012 12:43:33 +1200
- Subject: Re: [PATCH] Optimize MIPS memcpy
- References: <5044746c.23eb440a.75e2.618f@mx.google.com> <1346771341.14333.20.camel@ubuntu-sellcey>
On 5/09/2012, at 3:09 AM, Steve Ellcey wrote:
> On Mon, 2012-09-03 at 02:12 -0700, Andrew T Pinski wrote:
>> Forgot to CC libc-ports@ .
>> On Sat, 2012-09-01 at 18:15 +1200, Maxim Kuvyrkov wrote:
>>> This patch improves MIPS assembly implementations of memcpy. Two optimizations are added:
>> prefetching of data for subsequent iterations of memcpy loop and pipelined expansion of unaligned
>> memcpy. These optimizations speed up MIPS memcpy by about 10%.
>>>
>>> The prefetching part is straightforward: it adds prefetching of a cache line (32 bytes) for +1
>> iteration for unaligned case and +2 iteration for aligned case. The rationale here is that it will
>> take prefetch to acquire data about same time as 1 iteration of unaligned loop or 2 iterations of aligned loop. Values for these parameters were tuned on a modern MIPS processor.
>>>
>>
>> This might hurt Octeon as the cache line size there is 128 bytes. Can
>> you say which modern MIPS processor which this has been tuned with? And
>> is there a way to not hard code 32 in the assembly but in a macro
>> instead.
>>
>> Thanks,
>> Andrew Pinski
>
> I've been looking at the MIPS memcpy and was planning on submitting a
> new version based on the one that MIPS submitted to Android. It has
> prefetching like Maxim's though I found that using the load and 'prepare
> for store' hints instead of 'load streaming' and 'store streaming' hints
> gave me better results on the 74k and 24k that I did performance testing
> on.
I didn't experiment with various prefetching hints, so this very well may be the case.
>
> This version has more unrolling too and between that and the hints
> difference I got a small performance improvement over Maxim's version
> when doing small memcpy's and a fairly substantial improvement on large
> memcpy's.
>
> I also merged the 32 and 64 bit versions together so we would only have
> one copy to maintain. I haven't tried building it as part of glibc yet,
> I have been testing it standalone first and was going to try and
> integrate it into glibc and submit it this week or next. I'll attach it
> to this email so folks can look at it and I will see if I can
> parameterize the cache line size. This one also assumes a 32 byte cache
> prefetch.
>
Your version looks quite good. If you could wrap it up into a glibc patch I would test it on our setup to confirm that it indeed provides better performance.
Thanks,
--
Maxim Kuvyrkov
Mentor Graphics