This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores


On Tue, Jul 08, 2014 at 03:25:26PM +0400, Andrew Senkevich wrote:
> >
> > Does that prefetch improve performance? On x64 it harmed performance and 128 bytes looks too small to matter.
> > +
> > +       prefetcht0 -128(%edi, %esi)
> > +
> > +       movdqu  -64(%edi, %esi), %xmm0
> > +       movdqu  -48(%edi, %esi), %xmm1
> > +       movdqu  -32(%edi, %esi), %xmm2
> > +       movdqu  -16(%edi, %esi), %xmm3
> > +       movdqa  %xmm0, -64(%edi)
> > +       movdqa  %xmm1, -48(%edi)
> > +       movdqa  %xmm2, -32(%edi)
> > +       movdqa  %xmm3, -16(%edi)
> > +       leal    -64(%edi), %edi
> > +       cmp     %edi, %ebx
> > +       jb      L(mm_main_loop_backward)
> > +L(mm_main_loop_backward_end):
> > +       POP (%edi)
> > +       POP (%esi)
> > +       jmp     L(mm_recalc_len)
> 
> Disabling prefetch here and in below case leads to 10% degradation on
> Silvermont on 3 tests. On Haswell performance is almost the same.
>
I had silvermont optimization in my todo list, it needs a separate
implementation as it behaves differently than most other architectures.

>From testing that I done it looks that simply using a  rep movsq is
faster for strings upto around 1024 bytes.

 
> > +L(mm_recalc_len):
> > +/* Compute in %ecx how many bytes are left to copy after
> > +       the main loop stops.  */
> > +       movl    %ebx, %ecx
> > +       subl    %edx, %ecx
> > +       jmp     L(mm_len_0_or_more_backward)
> > +
> > That also looks slow as it adds unpredictable branch. On x64 we read start and end into registers before loop starts, and write these registers when it ends.
> > If you align to 16 bytes instead 64 you need only 4 registers to save end, 4 working and save 16 bytes at start into stack.
> 
> Not very clear what do you mean. On x64 where are no memmove and no
> backward case...
>
I was referring to memcpy as same trick could be applied to memmove
backward case. A forward loop looks like this but there could be problem
that you run out of registers.

 movdqu -16(%rsi,%rdx), %xmm4
 movdqu -32(%rsi,%rdx), %xmm5
 movdqu -48(%rsi,%rdx), %xmm6
 movdqu -64(%rsi,%rdx), %xmm7
 lea 	(%rdi, %rdx), %r10
 movdqu (%rsi), %xmm8

 movq 	%rdi, %rcx
 subq 	%rsi, %rcx
 cmpq 	%rdx, %rcx
 jb 	.Lbwd

 leaq 	16(%rdi), %rdx
 andq 	$-16, %rdx
 movq 	%rdx, %rcx
 subq 	%rdi, %rcx
 addq 	%rcx, %rsi
 movq 	%r10, %rcx
 subq 	%rdx, %rcx
 shrq 	$6, %rcx

 .p2align 4
.Lloop:
 movdqu (%rsi), %xmm0
 movdqu 16(%rsi), %xmm1
 movdqu 32(%rsi), %xmm2
 movdqu 48(%rsi), %xmm3
 movdqa %xmm0, (%rdx)
 addq	$64, %rsi
 movdqa %xmm1, 16(%rdx)
 movdqa %xmm2, 32(%rdx)
 movdqa %xmm3, 48(%rdx)
 addq	$64, %rdx
 sub	$1, %rcx
 jnz	.Lloop
 movdqu %xmm8, (%rdi)
 movdqu %xmm4, -16(%r10)
 movdqu %xmm5, -32(%r10)
 movdqu %xmm6, -48(%r10)
 movdqu %xmm7, -64(%r10)
 ret

 
> >
> > +       movdqu  %xmm0, (%edx)
> > +       movdqu  %xmm1, 16(%edx)
> > +       movdqu  %xmm2, 32(%edx)
> > +       movdqu  %xmm3, 48(%edx)
> > +       movdqa  %xmm4, (%edi)
> > +       movaps  %xmm5, 16(%edi)
> > +       movaps  %xmm6, 32(%edi)
> > +       movaps  %xmm7, 48(%edi)
> > Why did you add floating point moves here?
> 
> Because movaps with offset has 4 bytes length and it leads to improve
> in instructions alignment and code size.
> I also inserted it in more places.
> 
> > +
> > +/* We should stop two iterations before the termination
> > +       (in order not to misprefetch).  */
> > +       subl    $64, %ecx
> > +       cmpl    %ebx, %ecx
> > +       je      L(main_loop_just_one_iteration)
> > +
> > +       subl    $64, %ecx
> > +       cmpl    %ebx, %ecx
> > +       je      L(main_loop_last_two_iterations)
> > +
> > Same comment that prefetching will unlikely help, so you need show that it helps versus variant where you omit it.
> 
> Disabling prefetching here gives degradation up to -11% on Silvermont,
> on Haswell no significant changes.
> 
> > +
> > +       .p2align 4
> > +L(main_loop_large_page):
> >
> > However here prefetching should help as its sufficiently large, also loads could be nontemporal.
> 
> Prefetching here gives no significant performance change (all results attached).
> 
> Could you clarify what nontemporal loads do you mean? Here needed
> unaligned but I know only aligned nontemporal loads.
> Also not clear why prefetch should help with nontemporal access...
>
Try prefetchnta.
 
> Attaching edited patch.
> 32bit build was tested with no new regressions.








Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]