This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [ARM] Optimised strchr and strlen


On 21 December 2011 21:20, Richard Henderson <rth@twiddle.net> wrote:
> On 12/21/2011 02:55 AM, David Gilbert wrote:
>> That 'simple' one is showing the benefit at the short lengths,
>> the 'smarter' one I have is doing 8 bytes/loop and is nice on the long
>> strings - but as you can see worse at the short ones.
>
> Having not seen your "smarter" strchr, it's hard to suggest anything
> concrete. ?I'd have thought that there's enough slack in load delay
> that one or two arithmetic operations could be done without penalty...

Sure; it's pretty much the same trick as my strlen routine.

> Something like performing a simple compare loop looking for "alignment plus":

<snip>

> or even just
>
> ? ? ? ?and ? ? r3, r0, #7
> ? ? ? ?and ? ? r1, r1, #255
> ? ? ? ?rsb ? ? r3, r3, #32
> 1:
> ? ? ? ?ldrb ? ?r2, [r0],#1
> ? ? ? ?cmp ? ? r2, r1
> ? ? ? ?beq ? ? .Lfound
> ? ? ? ?subs ? ?r3, r3, #1
> ? ? ? ?cbz ? ? r2, .Lfound_zero
> ? ? ? ?bne ? ? 1b
> ? ? ? ?@ Here, r0 is aligned. ?Do something word-based.

OK, so I gave that a go - and the results are:

https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr?action=AttachFile&do=get&target=strchr-fiddle20111223-rth-strchr-abs.png

The line on that graph above labelled 'fiddle' is the one where I
tried that code; I can't explain the
really grim dip (at 16 bytes) - I checked with a debugger that it is
going around the loop
upto 32 times.  However, even if we ignored the dip at 16, it's still
not a nice result - it's worse
than the simple routine I posted up to 64 bytes.  One guess is that
loop with 3 branches in might
be too much for the branch predictor (I remember there are
restrictions on how close together they can go.

I might be able to make that a bit better for the larger cases by
using the clz trick you suggested for the strlen
for the end-of-fastloop case; and I'll give that a go; but it's not
going to help at the bottom.

(OT: Does anyone know how to get R to use a log2 X graph rather than log 10?)

Here are the raw numbers that maybe easier than the graphs (Note the
strings tested are aligned):

smarter_strchr_armv7fiddle: ,6553600, loops of ,64, bytes=400.000000
MB, transferred in ,1395904541.000000 ns, giving, 286.552546 MB/s
smarter_strchr_armv7fiddle: ,6553600, loops of ,62, bytes=387.500000
MB, transferred in ,1258880615.000000 ns, giving, 307.813144 MB/s
smarter_strchr_armv7fiddle: ,13107200, loops of ,32, bytes=400.000000
MB, transferred in ,1813507080.000000 ns, giving, 220.567101 MB/s
smarter_strchr_armv7fiddle: ,26214400, loops of ,16, bytes=400.000000
MB, transferred in ,2922332764.000000 ns, giving, 136.876951 MB/s
smarter_strchr_armv7fiddle: ,52428800, loops of ,8, bytes=400.000000
MB, transferred in ,1904724122.000000 ns, giving, 210.004166 MB/s
smarter_strchr_armv7fiddle: ,104857600, loops of ,4, bytes=400.000000
MB, transferred in ,2191650391.000000 ns, giving, 182.510861 MB/s
smarter_strchr_armv7fiddle: ,209715200, loops of ,2, bytes=400.000000
MB, transferred in ,3548339844.000000 ns, giving, 112.728774 MB/s

smarter_strchr_armv7: ,6553600, loops of ,64, bytes=400.000000 MB,
transferred in ,726409912.000000 ns, giving, 550.653279 MB/s
smarter_strchr_armv7: ,6553600, loops of ,62, bytes=387.500000 MB,
transferred in ,743652344.000000 ns, giving, 521.076822 MB/s
smarter_strchr_armv7: ,13107200, loops of ,32, bytes=400.000000 MB,
transferred in ,874023437.000000 ns, giving, 457.653632 MB/s
smarter_strchr_armv7: ,26214400, loops of ,16, bytes=400.000000 MB,
transferred in ,1160278321.000000 ns, giving, 344.744871 MB/s
smarter_strchr_armv7: ,52428800, loops of ,8, bytes=400.000000 MB,
transferred in ,1617584229.000000 ns, giving, 247.282332 MB/s
smarter_strchr_armv7: ,104857600, loops of ,4, bytes=400.000000 MB,
transferred in ,3235260010.000000 ns, giving, 123.637667 MB/s
smarter_strchr_armv7: ,209715200, loops of ,2, bytes=400.000000 MB,
transferred in ,7044738769.000000 ns, giving, 56.779962 MB/s

simple_strchr: ,6553600, loops of ,64, bytes=400.000000 MB,
transferred in ,1298034668.000000 ns, giving, 308.158179 MB/s
simple_strchr: ,6553600, loops of ,62, bytes=387.500000 MB,
transferred in ,1265472413.000000 ns, giving, 306.209757 MB/s
simple_strchr: ,13107200, loops of ,32, bytes=400.000000 MB,
transferred in ,1447998047.000000 ns, giving, 276.243467 MB/s
simple_strchr: ,26214400, loops of ,16, bytes=400.000000 MB,
transferred in ,1245819092.000000 ns, giving, 321.073904 MB/s
simple_strchr: ,52428800, loops of ,8, bytes=400.000000 MB,
transferred in ,1435119629.000000 ns, giving, 278.722409 MB/s
simple_strchr: ,104857600, loops of ,4, bytes=400.000000 MB,
transferred in ,1878479004.000000 ns, giving, 212.938233 MB/s
simple_strchr: ,209715200, loops of ,2, bytes=400.000000 MB,
transferred in ,3339569092.000000 ns, giving, 119.775932 MB/s

Dave


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]