This is the mail archive of the
newlib@sourceware.org
mailing list for the newlib project.
RE: Optimized arm string routines
On Fri, 2009-01-23 at 09:12 +0100, Schwarz, Konrad wrote:
> I'm sorry, I got confused. Eric Blake contributed a sophisticated
> strstr() using the Two Way algorithm, see
> http://sourceware.org/ml/newlib/2008/msg00029.html. I was wondering how
> this coexists with machine-specific implementations.
>
No problem. I don't think I'm going to try and duplicate that in
assembler -- far too complex :-)
R.
> Regards,
>
> Konrad Schwarz
>
> > -----Original Message-----
> > From: Richard Earnshaw [mailto:rearnsha@arm.com]
> > Sent: Thursday, January 22, 2009 19:35
> > To: Jeff Johnston
> > Cc: Schwarz, Konrad; newlib@sourceware.org
> > Subject: Re: Optimized arm string routines
> >
> > On Thu, 2009-01-22 at 13:05 -0500, Jeff Johnston wrote:
> > > Richard Earnshaw wrote:
> > > > On Thu, 2009-01-22 at 08:59 +0100, Schwarz, Konrad wrote:
> > > >
> > > >>> Subject: Re: Optimized arm string routines
> > > >>>
> > > >> How do these fit in with the optimized strcmp() provided
> > by Eric Blake?
> > > >>
> > > >>
> > > >
> > > > I've not seen that. Do you have a link?
> > > >
> > > > R.
> > > >
> > > >
> > > I believe Konrad is referring to the work Eric did for handling
> > > unaligned data to avoid performance penalties. A similar
> > optimization
> > > might make sense in the ASM version.
> > >
> >
> > The strcmp I posted handles all cases (except where termination occurs
> > very early on during the alignment phase) by using word loads (and
> > shifts if needed).
> >
> > > Perhaps if you were to publish some performance numbers (aligned,
> > > unaliged data) of the new code vs the generic, that would probably
> > > help. Do you happen to have any of these numbers handy?
> > >
> >
> > Not at the moment. I'll see if I can find time to knock out the
> > numbers. However, looking at the strcmp code from the C variant when
> > compiled for arm we have:
> >
> > 4c: e3530000 cmp r3, #0 ; 0x0
> > 50: 1a00001e bne d0 <strcmp+0xd0>
> > 54: e5b02004 ldr r2, [r0, #4]!
> > 58: e28234ff add r3, r2, #-16777216 ; 0xff000000
> > 5c: e2433801 sub r3, r3, #65536 ; 0x10000
> > 60: e2433c01 sub r3, r3, #256 ; 0x100
> > 64: e2433001 sub r3, r3, #1 ; 0x1
> > 68: e1c33002 bic r3, r3, r2
> > 6c: e3c3347f bic r3, r3, #2130706432 ; 0x7f000000
> > 70: e5b1c004 ldr ip, [r1, #4]!
> > 74: e3c3387f bic r3, r3, #8323072 ; 0x7f0000
> > 78: e3c33c7f bic r3, r3, #32512 ; 0x7f00
> > 7c: e152000c cmp r2, ip
> > 80: e3c3307f bic r3, r3, #127 ; 0x7f
> > 84: 0afffff0 beq 4c <strcmp+0x4c>
> >
> > for the aligned case. That comes to 2 loads 12 data insns
> > and 1 branch
> > (treating the not-taken branch as a data op) for each pair of 4 bytes.
> > Also, one of the loads is used in the immediately following insn,
> > creating a stall on many cpus. For the unaligned case the c code
> > compiles to
> > a4: e5d12000 ldrb r2, [r1]
> > a8: e1530002 cmp r3, r2
> > ac: e2800001 add r0, r0, #1 ; 0x1
> > b0: 1a000004 bne c8 <strcmp+0xc8>
> > b4: e5d03001 ldrb r3, [r0, #1]
> > b8: e3530000 cmp r3, #0 ; 0x0
> > bc: e2811001 add r1, r1, #1 ; 0x1
> > c0: 1afffff7 bne a4 <strcmp+0xa4>
> >
> > which gives 2 loads, 5 data ops and 1 branch for each pair of BYTES
> > compared. Additionally both loads are used in the following insn,
> > leading to stalls (though both are easily rectifiable in this case by
> > scheduling for a different CPU).
> >
> > The equivalent to the above in the new code is:
> >
> > 50: e04c2004 sub r2, ip, r4
> > 54: e15c0003 cmp ip, r3
> > 58: 0022200c eoreq r2, r2, ip
> > 5c: 01120384 tsteq r2, r4, lsl #7
> > 60: 0490c004 ldreq ip, [r0], #4
> > 64: 04913004 ldreq r3, [r1], #4
> > 68: 0afffff8 beq 50 <strcmp+0x50>
> >
> > which comes to 2 loads, 4 data insns and one branch for every 4 bytes
> > compared.
> >
> > and does the following when the strings do not have a mutual
> > alignment:
> >
> > e4: e3c4c4ff bic ip, r4, #-16777216 ; 0xff000000
> > e8: e15c0425 cmp ip, r5, lsr #8
> > ec: e0443002 sub r3, r4, r2
> > f0: e0233004 eor r3, r3, r4
> > f4: 1a000007 bne 118 <strcmp_unaligned+0x88>
> > f8: e0133382 ands r3, r3, r2, lsl #7
> > fc: 04915004 ldreq r5, [r1], #4
> > 100: 1a000006 bne 120 <strcmp_unaligned+0x90>
> > 104: e02cc004 eor ip, ip, r4
> > 108: e15c0c05 cmp ip, r5, lsl #24
> > 10c: 1a000008 bne 134 <strcmp_unaligned+0xa4>
> > 110: e4904004 ldr r4, [r0], #4
> > 114: eafffff2 b e4 <strcmp_unaligned+0x54>
> >
> > Note there are three possible cases for unaligned comparisons, each of
> > which has it's own variant loop (to avoid the need for expensive
> > variable shifts).
> >
> >
> > R.
> > --
> > Richard Earnshaw Email: Richard.Earnshaw@arm.com
> > ARM Ltd Phone: +44 1223 400569 (Direct +
> > VoiceMail)
> > 110 Fulbourn Road Switchboard: +44 1223 400400
> > Cherry Hinton Fax: +44 1223 400410
> > Cambridge CB1 9NJ Web: http://www.arm.com/
> > UK
> >
> >
--
Richard Earnshaw Email: Richard.Earnshaw@arm.com
ARM Ltd Phone: +44 1223 400569 (Direct + VoiceMail)
110 Fulbourn Road Switchboard: +44 1223 400400
Cherry Hinton Fax: +44 1223 400410
Cambridge CB1 9NJ Web: http://www.arm.com/
UK