This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction


>> +	mov	__x86_shared_cache_size_half(%rip), %r9
Ling: our machine has 4 core , each core run 2 hyper thread, 8M last
level cache.
so __x86_shared_cache_size_half(%rip) should give us 512k(8M/8 logic
core/2 ) , so when data size is over 8M LLC(512k * 16), we use
non-temporary store.

Thanks
Ling

2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
> On Mon, Jul 29, 2013 at 05:42:02AM -0400, ling.ma.program@gmail.com wrote:
>> From: Ma Ling <ling.ml@alibaba-inc.com>
>>
>> In this patch we use the similar approach with memcpy to
>> avoid branch instructions and force destination to be aligned
>> with avx instruction. By gcc.403 benchmark we find memset
>> spend more time than memcpy by 5~20 times.
>>
> Another issue is if a big loop is really needed. I tested variant with
> big loop disabled on ivy bridge and for sizes upto 262144 performance is
> about same but from that a rep movsb becomes 20% faster.
>
> Ljuba, could you test also this case?
>
> size: 262144
> 0.44	0.45	0.44
> 0.46	0.44	0.43
> 0.44	0.44	0.44
> 0.45	0.45	0.45
> 0.46	0.44	0.45
> 0.44	0.44	0.46
> 0.45	0.44	0.46
> 0.44	0.44	0.44
> 0.44	0.45	0.45
> 0.48	0.44	0.44
> size: 524288
> 0.54	0.47	0.45
> 0.55	0.45	0.45
> 0.55	0.44	0.46
> 0.53	0.45	0.46
> 0.52	0.45	0.44
> 0.54	0.45	0.44
> 0.54	0.44	0.45
> 0.55	0.44	0.45
> 0.52	0.44	0.46
> 0.54	0.45	0.45
>
>> +	ALIGN(4)
>> +L(gobble_data):
>> +#ifdef SHARED_CACHE_SIZE_HALF
>> +	mov	$SHARED_CACHE_SIZE_HALF, %r9
>> +#else
>> +	mov	__x86_shared_cache_size_half(%rip), %r9
>> +#endif
>> +	shl	$4, %r9
> Getting half of cache size then multiplying it by 16 ?
>> +	cmp	%r9, %rdx
>> +	ja	L(gobble_big_data)
>> +	mov	%rax, %r9
>> +	mov	%esi, %eax
>> +	mov	%rdx, %rcx
>> +	rep	stosb
>> +	mov	%r9, %rax
>> +	vzeroupper
>> +	ret
>> +
>> +	ALIGN(4)
>> +L(gobble_big_data):
>> +	sub	$0x80, %rdx
>> +L(gobble_big_data_loop):
>> +	vmovntdq	%ymm0, (%rdi)
>> +	vmovntdq	%ymm0, 0x20(%rdi)
>> +	vmovntdq	%ymm0, 0x40(%rdi)
>> +	vmovntdq	%ymm0, 0x60(%rdi)
>> +	lea	0x80(%rdi), %rdi
>> +	sub	$0x80, %rdx
>> +	jae	L(gobble_big_data_loop)
>> +	vmovups	%ymm0, -0x80(%r8)
>> +	vmovups	%ymm0, -0x60(%r8)
>> +	vmovups	%ymm0, -0x40(%r8)
>> +	vmovups	%ymm0, -0x20(%r8)
>> +	vzeroupper
>> +	sfence
>> +	ret
>> +
>> +END (MEMSET)
>> +#endif
>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]