This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: Optimized arm string routines

From: Richard Earnshaw <rearnsha at arm dot com>
To: "Schwarz, Konrad" <konrad dot schwarz at siemens dot com>
Cc: Jeff Johnston <jjohnstn at redhat dot com>, newlib at sourceware dot org
Date: Fri, 23 Jan 2009 13:51:27 +0000
Subject: RE: Optimized arm string routines
References: <1232446912.1485.12.camel@pc960.cambridge.arm.com> <4977B998.4080805@redhat.com> <A0725A2C3F6E9A4699A9A416DE9265EF02EADCDA@MCHP7RCA.ww002.siemens.net> <1232617584.6156.0.camel@pc960.cambridge.arm.com> <4978B54D.5060703@redhat.com> <1232649276.8187.19.camel@pc960.cambridge.arm.com> <A0725A2C3F6E9A4699A9A416DE9265EF02EAE013@MCHP7RCA.ww002.siemens.net>

On Fri, 2009-01-23 at 09:12 +0100, Schwarz, Konrad wrote:
> I'm sorry, I got confused.  Eric Blake contributed a sophisticated
> strstr() using the Two Way algorithm, see
> http://sourceware.org/ml/newlib/2008/msg00029.html.  I was wondering how
> this coexists with machine-specific implementations.
> 

No problem.  I don't think I'm going to try and duplicate that in
assembler -- far too complex :-)

R.
> Regards,
> 
> Konrad Schwarz
> 
> > -----Original Message-----
> > From: Richard Earnshaw [mailto:rearnsha@arm.com] 
> > Sent: Thursday, January 22, 2009 19:35
> > To: Jeff Johnston
> > Cc: Schwarz, Konrad; newlib@sourceware.org
> > Subject: Re: Optimized arm string routines
> > 
> > On Thu, 2009-01-22 at 13:05 -0500, Jeff Johnston wrote:
> > > Richard Earnshaw wrote:
> > > > On Thu, 2009-01-22 at 08:59 +0100, Schwarz, Konrad wrote:
> > > >   
> > > >>> Subject: Re: Optimized arm string routines
> > > >>>       
> > > >> How do these fit in with the optimized strcmp() provided 
> > by Eric Blake?
> > > >>
> > > >>     
> > > >
> > > > I've not seen that.  Do you have a link?
> > > >
> > > > R.
> > > >
> > > >   
> > > I believe Konrad is referring to the work Eric did for handling 
> > > unaligned data to avoid performance penalties.  A similar 
> > optimization 
> > > might make sense in the ASM version.
> > > 
> > 
> > The strcmp I posted handles all cases (except where termination occurs
> > very early on during the alignment phase) by using word loads (and
> > shifts if needed).
> > 
> > > Perhaps if you were to publish some performance numbers (aligned, 
> > > unaliged data) of the new code vs the generic, that would probably 
> > > help.  Do you happen to have any of these numbers handy?
> > > 
> > 
> > Not at the moment.  I'll see if I can find time to knock out the
> > numbers.  However, looking at the strcmp code from the C variant when
> > compiled for arm we have:
> > 
> >   4c:   e3530000        cmp     r3, #0  ; 0x0
> >   50:   1a00001e        bne     d0 <strcmp+0xd0>
> >   54:   e5b02004        ldr     r2, [r0, #4]!
> >   58:   e28234ff        add     r3, r2, #-16777216      ; 0xff000000
> >   5c:   e2433801        sub     r3, r3, #65536  ; 0x10000
> >   60:   e2433c01        sub     r3, r3, #256    ; 0x100
> >   64:   e2433001        sub     r3, r3, #1      ; 0x1
> >   68:   e1c33002        bic     r3, r3, r2
> >   6c:   e3c3347f        bic     r3, r3, #2130706432     ; 0x7f000000
> >   70:   e5b1c004        ldr     ip, [r1, #4]!
> >   74:   e3c3387f        bic     r3, r3, #8323072        ; 0x7f0000
> >   78:   e3c33c7f        bic     r3, r3, #32512  ; 0x7f00
> >   7c:   e152000c        cmp     r2, ip
> >   80:   e3c3307f        bic     r3, r3, #127    ; 0x7f
> >   84:   0afffff0        beq     4c <strcmp+0x4c>
> > 
> > for the aligned case.   That comes to 2 loads 12 data insns 
> > and 1 branch
> > (treating the not-taken branch as a data op) for each pair of 4 bytes.
> > Also, one of the loads is used in the immediately following insn,
> > creating a stall on many cpus.  For the unaligned case the c code
> > compiles to 
> >   a4:   e5d12000        ldrb    r2, [r1]
> >   a8:   e1530002        cmp     r3, r2
> >   ac:   e2800001        add     r0, r0, #1      ; 0x1
> >   b0:   1a000004        bne     c8 <strcmp+0xc8>
> >   b4:   e5d03001        ldrb    r3, [r0, #1]
> >   b8:   e3530000        cmp     r3, #0  ; 0x0
> >   bc:   e2811001        add     r1, r1, #1      ; 0x1
> >   c0:   1afffff7        bne     a4 <strcmp+0xa4>
> > 
> > which gives 2 loads, 5 data ops and 1 branch for each pair of BYTES
> > compared.  Additionally both loads are used in the following insn,
> > leading to stalls (though both are easily rectifiable in this case by
> > scheduling for a different CPU).
> > 
> > The equivalent to the above in the new code is:
> > 
> >   50:   e04c2004        sub     r2, ip, r4
> >   54:   e15c0003        cmp     ip, r3
> >   58:   0022200c        eoreq   r2, r2, ip
> >   5c:   01120384        tsteq   r2, r4, lsl #7
> >   60:   0490c004        ldreq   ip, [r0], #4
> >   64:   04913004        ldreq   r3, [r1], #4
> >   68:   0afffff8        beq     50 <strcmp+0x50>
> > 
> > which comes to 2 loads, 4 data insns and one branch for every 4 bytes
> > compared.
> > 
> > and does the following when the strings do not have a mutual 
> > alignment:
> > 
> >   e4:   e3c4c4ff        bic     ip, r4, #-16777216      ; 0xff000000
> >   e8:   e15c0425        cmp     ip, r5, lsr #8
> >   ec:   e0443002        sub     r3, r4, r2
> >   f0:   e0233004        eor     r3, r3, r4
> >   f4:   1a000007        bne     118 <strcmp_unaligned+0x88>
> >   f8:   e0133382        ands    r3, r3, r2, lsl #7
> >   fc:   04915004        ldreq   r5, [r1], #4
> >  100:   1a000006        bne     120 <strcmp_unaligned+0x90>
> >  104:   e02cc004        eor     ip, ip, r4
> >  108:   e15c0c05        cmp     ip, r5, lsl #24
> >  10c:   1a000008        bne     134 <strcmp_unaligned+0xa4>
> >  110:   e4904004        ldr     r4, [r0], #4
> >  114:   eafffff2        b       e4 <strcmp_unaligned+0x54>
> > 
> > Note there are three possible cases for unaligned comparisons, each of
> > which has it's own variant loop (to avoid the need for expensive
> > variable shifts).
> > 
> > 
> > R.
> > -- 
> > Richard Earnshaw             Email: Richard.Earnshaw@arm.com
> > ARM Ltd                      Phone: +44 1223 400569 (Direct + 
> > VoiceMail)
> > 110 Fulbourn Road            Switchboard: +44 1223 400400
> > Cherry Hinton                Fax: +44 1223 400410
> > Cambridge CB1 9NJ            Web: http://www.arm.com/
> > UK
> > 
> > 
-- 
Richard Earnshaw             Email: Richard.Earnshaw@arm.com
ARM Ltd                      Phone: +44 1223 400569 (Direct + VoiceMail)
110 Fulbourn Road            Switchboard: +44 1223 400400
Cherry Hinton                Fax: +44 1223 400410
Cambridge CB1 9NJ            Web: http://www.arm.com/
UK

References:
- Optimized arm string routines
  - From: Richard Earnshaw
- Re: Optimized arm string routines
  - From: Jeff Johnston
- RE: Optimized arm string routines
  - From: Schwarz, Konrad
- RE: Optimized arm string routines
  - From: Richard Earnshaw
- Re: Optimized arm string routines
  - From: Jeff Johnston
- Re: Optimized arm string routines
  - From: Richard Earnshaw
- RE: Optimized arm string routines
  - From: Schwarz, Konrad

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]