This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction


On Thu, Jun 5, 2014 at 9:32 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> On Wed, Jun 04, 2014 at 03:00:05PM +0800, Ling Ma wrote:
>> H.J
>>
>> The website changed IP, now the code is available again:
>>  http://www.yunos.org/tmp/memset-avx2.patch ,
>> and also gziped as attachment in this mail.
>>
>> Thanks
>> Ling
>>
>
> Now performance looks ok for me, but few formating problems.
> With these fixed I would be satisfied H.J do you have comments?

I don't have any additional comments.  Thanks.

> There is possible followup to also optimize __bzero like we do in
> general case.
>
> Then second followup would be decrease function size by reshuffling
> blocks, on several places there are 15/16 free bytes due alignment.
>
> Formatting problems are here:
>
> +       vpxor   %xmm0, %xmm0, %xmm0
> +       vmovd %esi, %xmm1
> +       mov     %rdi, %rsi
> +       mov     %rdi, %rax
>
> here
>
> +L(less_16bytes):
> +       vmovd %xmm0, %rcx
> +       cmp     $8, %dl
> +       jb      L(less_8bytes)
> +       mov %rcx, (%rdi)
> +       mov %rcx, -0x08(%rsi)
> +       ret
> +
> +       .p2align 4
> +L(less_8bytes):
> +       cmp     $4, %dl
> +       jb      L(less_4bytes)
> +       mov %ecx, (%rdi)
> +       mov %ecx, -0x04(%rsi)
> +       ret
>
> and here
>
> +       mov     %rax, %rsi
> +       vmovd %xmm0, %eax
> +       mov     %rdx, %rcx
>
> As I mentioned code size one trick is that instructions
> with -128 argument are shorter than with 128. You could save 16
> bytes with following modification, however it must be tested if
> it improves performance.
>
>
> --- x   2014-06-05 18:20:35.313645591 +0200
> +++ sysdeps/x86_64/multiarch/memset-avx2.S      2014-06-05
> 18:22:25.068642767 +0200
> @@ -95,7 +95,6 @@
>         .p2align 4
>  L(256bytesormore):
>         vinserti128 $1, %xmm0, %ymm0, %ymm0
> -       mov     $0x80, %rcx
>         add     %rdx, %rsi
>         mov     %rdi, %r9
>         vmovdqu %ymm0, (%rdi)
> @@ -105,15 +104,15 @@
>         add     %r9, %rdx
>         cmp     $4096, %rdx
>         ja      L(gobble_data)
> -       sub     %ecx, %edx
> +       add     $-128, %edx
>  L(gobble_128_loop):
>         vmovdqa %ymm0, (%rdi)
>         vmovdqa %ymm0, 0x20(%rdi)
>         vmovdqa %ymm0, 0x40(%rdi)
>         vmovdqa %ymm0, 0x60(%rdi)
> -       add     %rcx, %rdi
> -       sub     %ecx, %edx
> -       jae     L(gobble_128_loop)
> +       sub     $-128, %rdi
> +       add     $-128, %edx
> +       jb      L(gobble_128_loop)
>         vmovdqu %ymm0, -0x80(%rsi)
>         vmovdqu %ymm0, -0x60(%rsi)
>         vmovdqu %ymm0, -0x40(%rsi)
>



-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]