This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Don't use SSE4_2 instructions on Intel Silvermont Micro Architecture.
- From: Dmitrieva Liubov <liubov dot dmitrieva at gmail dot com>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: "H.J. Lu" <hjl dot tools at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 19 Jun 2013 12:17:25 +0400
- Subject: Re: [PATCH] Don't use SSE4_2 instructions on Intel Silvermont Micro Architecture.
- References: <CAHjhQ93=uegeZg9iTqoJ+PFuUrvn8e2mA8tZ96Jy4CaV6aPbWg at mail dot gmail dot com> <20130617163729 dot GA15981 at domone dot kolej dot mff dot cuni dot cz> <CAHjhQ93zmP525hqW-2RnHBREc_949XLnm7sE-CSv3Nj8PQgUig at mail dot gmail dot com> <CAMe9rOqT31AFq1S3V0Krh2CZnHu=FiyXqhg840fimRtfU4_hXQ at mail dot gmail dot com> <20130618064910 dot GA19972 at domone dot kolej dot mff dot cuni dot cz>
Moreover SSSE3 is not good for Silvermont and there are no sse2
unaligned versions for strcmp and memcmp to switch at the moment. I
think we need to have unaligned versions for Core i7 as well.
This is another room for optimization.
I will add new flag bit_Slow_SSE4_2 and switch some function as a
short term solution.
--
Liubov Dmitrieva
On Tue, Jun 18, 2013 at 10:49 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> On Mon, Jun 17, 2013 at 11:07:33AM -0700, H.J. Lu wrote:
>> On Mon, Jun 17, 2013 at 10:56 AM, Dmitrieva Liubov
>> <liubov.dmitrieva@gmail.com> wrote:
>> > I checked that functions.
>> > In case of strspn/strcspn/strpbrk to switch SSE4_2 off is bad because
>> > there are no optimized sse2 versions to call instead.
>> > Default versions are not sse there.
>> >
>> > So, it seems we need to create a new flag for Silvermont like
>> > "slowPcmpistri" and fix switches in functions where optimized sse2
>> > exist.
>> >
>> > Or implement optimized sse2 strspn/strcspn/strpbrk and switch SSE4_2 completely.
>> >
> I asked because these are about only case where I cannot get comparable
> results with SSE2. A closest I could get try to split input into upto
> four character intervals and check this in parallel.
> This has bit expensive preprocessing so I still look how to do it
> better.
>>
>> We can add bit_Prefer_SSE2_for_stringop. When it is set, we
>> will use SSE2 version if it is available. Otherwise, we use
>> SSE4_2 version if it is available.
>>
>>
> As short term solution I would prefer bit_Slow_SSE4_2.
>
> As long term solution I have optimized implementations for other
> functions that do not use SSE4_2 and are faster.
>
>
>
> When I run `git grep "cmp[ie]str[ie]"` I got
>
> sysdeps/i386/i686/multiarch/strcmp-sse4.S
> sysdeps/x86_64/multiarch/strcmp-sse42.S
>
> I have several ideas but did not get to it yet. It has low priority as a
> hot case is when strings differ in first 16 characters (for example when
> you are sorting.)
>
>
> sysdeps/x86_64/multiarch/rawmemchr.S
>
> Not our case as it needs bit_SSE4_2 and not bit_Prefer_PMINUB_for_stringop
>
> This is false on intel all processors. Most AMD processors are
> misclassified because we do not set anything at all. They have slower
> SSE4_2 which causes performance regression.
>
>
> sysdeps/x86_64/multiarch/strchr.S
> sysdeps/x86_64/multiarch/strrchr.S
>
> I have implementation with faster asyptomatic time but I did not have
> tunning in small cases.
>
>
> sysdeps/x86_64/multiarch/strend-sse4.S
>
> It is bit wierd why do we have this. Definitely you could improve
> performance by taking strlen and modifying return value.
>
> sysdeps/x86_64/multiarch/strstr.c
>
> I have better implementation, I decided to wait for 2.19