This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: PATCH: Optimize memcmp for ia32
On Tue, Feb 10, 2004 at 09:18:30AM -0800, H. J. Lu wrote:
> On Tue, Feb 10, 2004 at 03:48:19PM +0100, Jakub Jelinek wrote:
> > On Wed, Feb 04, 2004 at 04:11:26PM -0800, H. J. Lu wrote:
> > > This patch optimizes memcmp for ia32. I got average speeup by around
> > > 400%.
> >
> > If not anything else, you should certainly handle PIC vs. !PIC differently
> > (for !PIC you don't need to call thunk etc.).
>
> I can change it.
>
> > Also, why do you need to use %ebx register when for example %eax is always
> > available?
>
> I will take a look.
>
> > Why do you need 4 separate L(Nbytes) sequences, the only difference between
> > them is in the last few instructions? The bigger the routine is, the more
> > other instructions will be kicked out of the caches (especially for a
> > routine which is not the topmost in the benchmarks).
> > I'd say avoiding the table_32bytes table altogether, using just one of the
> > 4 sequences (with adjusted start) and computing the jump destination in
> > registers shouldn't slow things down.
>
> The adjustement may cause the slow down. With the jump table, we don't
> need to adjust anything at all for memoy block smaller than 32 bytes.
> That is where the speedup comes from.
I meant instead of
addl %ecx, %edx
addl %ecx, %esi
do:
andl $-4, %ecx
addl %ecx, %edx
addl %ecx, %esi
or something like that (then you'd just start with -28(%esi) -> for 4
cases). The %ecx & 3 previous value would need to be preserved till
the end, e.g. in the %ebx register which could be replaced with %eax
and you could hardcode that it jumps to L(28bytes) + 14 * (INDEX / 4).
Jakub