This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores

From: Andrew Senkevich <andrew dot n dot senkevich at gmail dot com>
To: OndÅej BÃlka <neleai at seznam dot cz>, libc-alpha <libc-alpha at sourceware dot org>
Date: Tue, 8 Jul 2014 15:25:26 +0400
Subject: Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
Authentication-results: sourceware.org; auth=none
References: <CAMXFM3t+TwhkeJbDXz0TSt-MZH3KOXTDWoT1nTiWjEyw1VSgcg at mail dot gmail dot com> <20140706134246 dot GA5694 at domone dot podge>

Hi, OndÅej!

>
> +
> +ENTRY (MEMCPY)
> +       ENTRANCE
> +       movl    LEN(%esp), %ecx
> +       movl    SRC(%esp), %eax
> +       movl    DEST(%esp), %edx
> +
> +       cmp     %edx, %eax
> +       je      L(return)
> +
> As case src==dest is quite rare this would slow down implementation, drop that.

Ok.

> +# ifdef USE_AS_MEMMOVE
> +       jg      L(check_forward)
> +
> +       add     %ecx, %eax
> +       cmp     %edx, %eax
> +       movl    SRC(%esp), %eax
> +       jle     L(forward)
> +
> Also you do not need this check here, until 128 bytes we first read entire src and only after that do writes.

Thank you, ok.

> snip
>
> Does that prefetch improve performance? On x64 it harmed performance and 128 bytes looks too small to matter.
> +
> +       prefetcht0 -128(%edi, %esi)
> +
> +       movdqu  -64(%edi, %esi), %xmm0
> +       movdqu  -48(%edi, %esi), %xmm1
> +       movdqu  -32(%edi, %esi), %xmm2
> +       movdqu  -16(%edi, %esi), %xmm3
> +       movdqa  %xmm0, -64(%edi)
> +       movdqa  %xmm1, -48(%edi)
> +       movdqa  %xmm2, -32(%edi)
> +       movdqa  %xmm3, -16(%edi)
> +       leal    -64(%edi), %edi
> +       cmp     %edi, %ebx
> +       jb      L(mm_main_loop_backward)
> +L(mm_main_loop_backward_end):
> +       POP (%edi)
> +       POP (%esi)
> +       jmp     L(mm_recalc_len)

Disabling prefetch here and in below case leads to 10% degradation on
Silvermont on 3 tests. On Haswell performance is almost the same.

> +L(mm_recalc_len):
> +/* Compute in %ecx how many bytes are left to copy after
> +       the main loop stops.  */
> +       movl    %ebx, %ecx
> +       subl    %edx, %ecx
> +       jmp     L(mm_len_0_or_more_backward)
> +
> That also looks slow as it adds unpredictable branch. On x64 we read start and end into registers before loop starts, and write these registers when it ends.
> If you align to 16 bytes instead 64 you need only 4 registers to save end, 4 working and save 16 bytes at start into stack.

Not very clear what do you mean. On x64 where are no memmove and no
backward case...

>
> +       movdqu  %xmm0, (%edx)
> +       movdqu  %xmm1, 16(%edx)
> +       movdqu  %xmm2, 32(%edx)
> +       movdqu  %xmm3, 48(%edx)
> +       movdqa  %xmm4, (%edi)
> +       movaps  %xmm5, 16(%edi)
> +       movaps  %xmm6, 32(%edi)
> +       movaps  %xmm7, 48(%edi)
> Why did you add floating point moves here?

Because movaps with offset has 4 bytes length and it leads to improve
in instructions alignment and code size.
I also inserted it in more places.

> +
> +/* We should stop two iterations before the termination
> +       (in order not to misprefetch).  */
> +       subl    $64, %ecx
> +       cmpl    %ebx, %ecx
> +       je      L(main_loop_just_one_iteration)
> +
> +       subl    $64, %ecx
> +       cmpl    %ebx, %ecx
> +       je      L(main_loop_last_two_iterations)
> +
> Same comment that prefetching will unlikely help, so you need show that it helps versus variant where you omit it.

Disabling prefetching here gives degradation up to -11% on Silvermont,
on Haswell no significant changes.

> +
> +       .p2align 4
> +L(main_loop_large_page):
>
> However here prefetching should help as its sufficiently large, also loads could be nontemporal.

Prefetching here gives no significant performance change (all results attached).

Could you clarify what nontemporal loads do you mean? Here needed
unaligned but I know only aligned nontemporal loads.
Also not clear why prefetch should help with nontemporal access...

Attaching edited patch.
32bit build was tested with no new regressions.

Attachment: results_memcpy_slm.tar.bz2
Description: BZip2 compressed data

Attachment: results_memcpy_hsw.tar.bz2
Description: BZip2 compressed data

Attachment: results_memmove_slm.tar.bz2
Description: BZip2 compressed data

Attachment: results_memmove_hsw.tar.bz2
Description: BZip2 compressed data

Attachment: memcpy_mempcpy_memmove_with_chk_sse2_unaligned_i386_v2.patch
Description: Binary data

Follow-Ups:
- Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
  - From: OndÅej BÃlka

References:
- [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
  - From: Andrew Senkevich
- Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]