This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction


On Mon, Jul 29, 2013 at 05:42:02AM -0400, ling.ma.program@gmail.com wrote:
> From: Ma Ling <ling.ml@alibaba-inc.com>
> 
> In this patch we use the similar approach with memcpy to
> avoid branch instructions and force destination to be aligned
> with avx instruction. By gcc.403 benchmark we find memset
> spend more time than memcpy by 5~20 times.
> 
Another issue is if a big loop is really needed. I tested variant with
big loop disabled on ivy bridge and for sizes upto 262144 performance is
about same but from that a rep movsb becomes 20% faster.

Ljuba, could you test also this case?

size: 262144
0.44	0.45	0.44
0.46	0.44	0.43
0.44	0.44	0.44
0.45	0.45	0.45
0.46	0.44	0.45
0.44	0.44	0.46
0.45	0.44	0.46
0.44	0.44	0.44
0.44	0.45	0.45
0.48	0.44	0.44
size: 524288
0.54	0.47	0.45
0.55	0.45	0.45
0.55	0.44	0.46
0.53	0.45	0.46
0.52	0.45	0.44
0.54	0.45	0.44
0.54	0.44	0.45
0.55	0.44	0.45
0.52	0.44	0.46
0.54	0.45	0.45
 
> +	ALIGN(4)
> +L(gobble_data):
> +#ifdef SHARED_CACHE_SIZE_HALF
> +	mov	$SHARED_CACHE_SIZE_HALF, %r9
> +#else
> +	mov	__x86_shared_cache_size_half(%rip), %r9
> +#endif
> +	shl	$4, %r9
Getting half of cache size then multiplying it by 16 ? 
> +	cmp	%r9, %rdx
> +	ja	L(gobble_big_data)
> +	mov	%rax, %r9
> +	mov	%esi, %eax
> +	mov	%rdx, %rcx
> +	rep	stosb
> +	mov	%r9, %rax
> +	vzeroupper
> +	ret
> +
> +	ALIGN(4)
> +L(gobble_big_data):
> +	sub	$0x80, %rdx
> +L(gobble_big_data_loop):
> +	vmovntdq	%ymm0, (%rdi)
> +	vmovntdq	%ymm0, 0x20(%rdi)
> +	vmovntdq	%ymm0, 0x40(%rdi)
> +	vmovntdq	%ymm0, 0x60(%rdi)
> +	lea	0x80(%rdi), %rdi
> +	sub	$0x80, %rdx
> +	jae	L(gobble_big_data_loop)
> +	vmovups	%ymm0, -0x80(%r8)
> +	vmovups	%ymm0, -0x60(%r8)
> +	vmovups	%ymm0, -0x40(%r8)
> +	vmovups	%ymm0, -0x20(%r8)
> +	vzeroupper
> +	sfence
> +	ret
> +
> +END (MEMSET)
> +#endif

Attachment: memset_big.tar.bz2
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]