This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org, aj at suse dot com, liubov dot dmitrieva at gmail dot com, Ma Ling <ling dot ml at alibaba-inc dot com>
- Date: Tue, 30 Jul 2013 21:13:28 +0800
- Subject: Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
- References: <1375090922-8418-1-git-send-email-ling dot ma dot program at gmail dot com> <20130730121907 dot GA2881 at domone dot kolej dot mff dot cuni dot cz>
>> + mov __x86_shared_cache_size_half(%rip), %r9
Ling: our machine has 4 core , each core run 2 hyper thread, 8M last
level cache.
so __x86_shared_cache_size_half(%rip) should give us 512k(8M/8 logic
core/2 ) , so when data size is over 8M LLC(512k * 16), we use
non-temporary store.
Thanks
Ling
2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
> On Mon, Jul 29, 2013 at 05:42:02AM -0400, ling.ma.program@gmail.com wrote:
>> From: Ma Ling <ling.ml@alibaba-inc.com>
>>
>> In this patch we use the similar approach with memcpy to
>> avoid branch instructions and force destination to be aligned
>> with avx instruction. By gcc.403 benchmark we find memset
>> spend more time than memcpy by 5~20 times.
>>
> Another issue is if a big loop is really needed. I tested variant with
> big loop disabled on ivy bridge and for sizes upto 262144 performance is
> about same but from that a rep movsb becomes 20% faster.
>
> Ljuba, could you test also this case?
>
> size: 262144
> 0.44 0.45 0.44
> 0.46 0.44 0.43
> 0.44 0.44 0.44
> 0.45 0.45 0.45
> 0.46 0.44 0.45
> 0.44 0.44 0.46
> 0.45 0.44 0.46
> 0.44 0.44 0.44
> 0.44 0.45 0.45
> 0.48 0.44 0.44
> size: 524288
> 0.54 0.47 0.45
> 0.55 0.45 0.45
> 0.55 0.44 0.46
> 0.53 0.45 0.46
> 0.52 0.45 0.44
> 0.54 0.45 0.44
> 0.54 0.44 0.45
> 0.55 0.44 0.45
> 0.52 0.44 0.46
> 0.54 0.45 0.45
>
>> + ALIGN(4)
>> +L(gobble_data):
>> +#ifdef SHARED_CACHE_SIZE_HALF
>> + mov $SHARED_CACHE_SIZE_HALF, %r9
>> +#else
>> + mov __x86_shared_cache_size_half(%rip), %r9
>> +#endif
>> + shl $4, %r9
> Getting half of cache size then multiplying it by 16 ?
>> + cmp %r9, %rdx
>> + ja L(gobble_big_data)
>> + mov %rax, %r9
>> + mov %esi, %eax
>> + mov %rdx, %rcx
>> + rep stosb
>> + mov %r9, %rax
>> + vzeroupper
>> + ret
>> +
>> + ALIGN(4)
>> +L(gobble_big_data):
>> + sub $0x80, %rdx
>> +L(gobble_big_data_loop):
>> + vmovntdq %ymm0, (%rdi)
>> + vmovntdq %ymm0, 0x20(%rdi)
>> + vmovntdq %ymm0, 0x40(%rdi)
>> + vmovntdq %ymm0, 0x60(%rdi)
>> + lea 0x80(%rdi), %rdi
>> + sub $0x80, %rdx
>> + jae L(gobble_big_data_loop)
>> + vmovups %ymm0, -0x80(%r8)
>> + vmovups %ymm0, -0x60(%r8)
>> + vmovups %ymm0, -0x40(%r8)
>> + vmovups %ymm0, -0x20(%r8)
>> + vzeroupper
>> + sfence
>> + ret
>> +
>> +END (MEMSET)
>> +#endif
>
>