This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Andrew Senkevich <andrew dot n dot senkevich at gmail dot com>
- Cc: libc-alpha <libc-alpha at sourceware dot org>
- Date: Tue, 8 Jul 2014 20:54:33 +0200
- Subject: Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
- Authentication-results: sourceware.org; auth=none
- References: <CAMXFM3t+TwhkeJbDXz0TSt-MZH3KOXTDWoT1nTiWjEyw1VSgcg at mail dot gmail dot com> <20140706134246 dot GA5694 at domone dot podge> <CAMXFM3uT9arjY4YsVmSuiXvV_AGWONzCoa4LcQ0fFy0b98KrKw at mail dot gmail dot com>
On Tue, Jul 08, 2014 at 03:25:26PM +0400, Andrew Senkevich wrote:
> >
> > Does that prefetch improve performance? On x64 it harmed performance and 128 bytes looks too small to matter.
> > +
> > + prefetcht0 -128(%edi, %esi)
> > +
> > + movdqu -64(%edi, %esi), %xmm0
> > + movdqu -48(%edi, %esi), %xmm1
> > + movdqu -32(%edi, %esi), %xmm2
> > + movdqu -16(%edi, %esi), %xmm3
> > + movdqa %xmm0, -64(%edi)
> > + movdqa %xmm1, -48(%edi)
> > + movdqa %xmm2, -32(%edi)
> > + movdqa %xmm3, -16(%edi)
> > + leal -64(%edi), %edi
> > + cmp %edi, %ebx
> > + jb L(mm_main_loop_backward)
> > +L(mm_main_loop_backward_end):
> > + POP (%edi)
> > + POP (%esi)
> > + jmp L(mm_recalc_len)
>
> Disabling prefetch here and in below case leads to 10% degradation on
> Silvermont on 3 tests. On Haswell performance is almost the same.
>
I had silvermont optimization in my todo list, it needs a separate
implementation as it behaves differently than most other architectures.
>From testing that I done it looks that simply using a rep movsq is
faster for strings upto around 1024 bytes.
> > +L(mm_recalc_len):
> > +/* Compute in %ecx how many bytes are left to copy after
> > + the main loop stops. */
> > + movl %ebx, %ecx
> > + subl %edx, %ecx
> > + jmp L(mm_len_0_or_more_backward)
> > +
> > That also looks slow as it adds unpredictable branch. On x64 we read start and end into registers before loop starts, and write these registers when it ends.
> > If you align to 16 bytes instead 64 you need only 4 registers to save end, 4 working and save 16 bytes at start into stack.
>
> Not very clear what do you mean. On x64 where are no memmove and no
> backward case...
>
I was referring to memcpy as same trick could be applied to memmove
backward case. A forward loop looks like this but there could be problem
that you run out of registers.
movdqu -16(%rsi,%rdx), %xmm4
movdqu -32(%rsi,%rdx), %xmm5
movdqu -48(%rsi,%rdx), %xmm6
movdqu -64(%rsi,%rdx), %xmm7
lea (%rdi, %rdx), %r10
movdqu (%rsi), %xmm8
movq %rdi, %rcx
subq %rsi, %rcx
cmpq %rdx, %rcx
jb .Lbwd
leaq 16(%rdi), %rdx
andq $-16, %rdx
movq %rdx, %rcx
subq %rdi, %rcx
addq %rcx, %rsi
movq %r10, %rcx
subq %rdx, %rcx
shrq $6, %rcx
.p2align 4
.Lloop:
movdqu (%rsi), %xmm0
movdqu 16(%rsi), %xmm1
movdqu 32(%rsi), %xmm2
movdqu 48(%rsi), %xmm3
movdqa %xmm0, (%rdx)
addq $64, %rsi
movdqa %xmm1, 16(%rdx)
movdqa %xmm2, 32(%rdx)
movdqa %xmm3, 48(%rdx)
addq $64, %rdx
sub $1, %rcx
jnz .Lloop
movdqu %xmm8, (%rdi)
movdqu %xmm4, -16(%r10)
movdqu %xmm5, -32(%r10)
movdqu %xmm6, -48(%r10)
movdqu %xmm7, -64(%r10)
ret
> >
> > + movdqu %xmm0, (%edx)
> > + movdqu %xmm1, 16(%edx)
> > + movdqu %xmm2, 32(%edx)
> > + movdqu %xmm3, 48(%edx)
> > + movdqa %xmm4, (%edi)
> > + movaps %xmm5, 16(%edi)
> > + movaps %xmm6, 32(%edi)
> > + movaps %xmm7, 48(%edi)
> > Why did you add floating point moves here?
>
> Because movaps with offset has 4 bytes length and it leads to improve
> in instructions alignment and code size.
> I also inserted it in more places.
>
> > +
> > +/* We should stop two iterations before the termination
> > + (in order not to misprefetch). */
> > + subl $64, %ecx
> > + cmpl %ebx, %ecx
> > + je L(main_loop_just_one_iteration)
> > +
> > + subl $64, %ecx
> > + cmpl %ebx, %ecx
> > + je L(main_loop_last_two_iterations)
> > +
> > Same comment that prefetching will unlikely help, so you need show that it helps versus variant where you omit it.
>
> Disabling prefetching here gives degradation up to -11% on Silvermont,
> on Haswell no significant changes.
>
> > +
> > + .p2align 4
> > +L(main_loop_large_page):
> >
> > However here prefetching should help as its sufficiently large, also loads could be nontemporal.
>
> Prefetching here gives no significant performance change (all results attached).
>
> Could you clarify what nontemporal loads do you mean? Here needed
> unaligned but I know only aligned nontemporal loads.
> Also not clear why prefetch should help with nontemporal access...
>
Try prefetchnta.
> Attaching edited patch.
> 32bit build was tested with no new regressions.
- References:
- [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
- Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
- Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores