This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction

From: OndÅej BÃlka <neleai at seznam dot cz>
To: ling dot ma dot program at gmail dot com
Cc: libc-alpha at sourceware dot org, Ma Ling <ling dot ml at alibaba-inc dot com>
Date: Fri, 12 Jul 2013 22:51:42 +0200
Subject: Re: [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
References: <1373455876-31827-1-git-send-email-ling dot ma dot program at gmail dot com>

On Wed, Jul 10, 2013 at 07:31:16AM -0400, ling.ma.program@gmail.com wrote:
> From: Ma Ling <ling.ml@alibaba-inc.com>
> 
> In this patch we use the similar approach with memcpy to avoid branch
> instructions and force destination to be aligned with avx2 instruction.
> By gcc.403 benchmark we find memset spend more time than memcpy by 5~10 times.
> The benchmark also indicate this patch improve performance from  30% to 100%
> compared with original __memset_sse2.
>
Please provide link to benchmark and how it is ran. Without it you have
only your word and I could say for example:

I used bogusbench on my harddisk not only that avx version is 3 times
slower than __memset_sse2 but I checked __memset_sse2 againist and
byte-by-byte version and byte-by byte version wins by factor of four.

and my words would have equal weight.

Also your implementation gives wrong answer, please fix it until we can
seriously consider it.
 
> Thanks
> Ling
> ---

> +ENTRY (MEMSET)
> +	vzeroupper
> +	mov	$0x01010101, %ecx
> +	mov	%rdi, %rax
> +	cmp	$256, %rdx
> +	jae	L(256bytesormore)
> +	cmp	$1, %rdx
> +	ja	L(more_1_bytes)
> +	jb	L(less_1_bytes)

Do you think that using memset to write one byte is likely case? 
If not then you slowed memset by always doing unnecessary check that
 can be done only when after we know that n<4.

> +	mov	%sil, (%rdi)
> +L(less_1_bytes):
> +	ret
> +	ALIGN(4)
> +L(more_1_bytes):
> +	movzbl	%sil, %ecx
> +	movb	%cl, %ch
> +	cmp	$4, %edx
> +	ja	L(more_3bytes)
> +	mov	%cx, (%rdi)
> +	mov	%cx, -0x02(%rdi, %rdx)
> +	ret
> +	ALIGN(4)
> +L(more_3bytes):
> +	mov	%ecx, %r9d
> +	shl	$16, %ecx
> +	or	%r9d, %ecx
> +	cmp	$8, %edx
> +	ja	L(more_8bytes)
> +	mov %ecx, (%rdi)
> +	mov %ecx, -0x04(%rdi, %rdx)
> +	ret
snip
> +	ALIGN(4)
> +L(256bytesormore):
> +	lea	(%rdi, %rdx), %r8
> +	
> +#ifndef USE_AS_BZERO
by duplicating bzero and memset code you doubled icache pressure and
made branch prediction weaker. You should in bzero only set arguments
and jump to memset. 
 
> +	and	$0xff, %esi
> +	imul %esi, %ecx
that is slower than code generated from _mm_set1_epi8
instrinct. Also you have sse4 available so you can just 
use pshufb
> +	vmovd %ecx, %xmm0
> +	vpshufd	$0, %xmm0, %xmm0
> +	vzeroupper
> +	vinserti128 $1, %xmm0, %ymm0, %ymm0
> +#else
> +	vpxor %xmm0, %xmm0, %xmm0
> +#endif

> +	sub	$0x80, %rdx
> +L(gobble_128_loop):
> +	prefetcht0 0x1c0(%rdi)
> +	vmovaps	%ymm0, (%rdi)
> +	prefetcht0 0x280(%rdi)
you should be so aggressive with prefetches when you know how much data
you use. This fetches unnecessary data which can double cache usage and
generaly slow us down. 
> +	vmovaps	%ymm0, 0x20(%rdi)
> +	vmovaps	%ymm0, 0x40(%rdi)
> +	vmovaps	%ymm0, 0x60(%rdi)
> +	lea	0x80(%rdi), %rdi
> +	sub	$0x80, %rdx
> +	jae	L(gobble_128_loop)
> +L(gobble_exit):
> +	vmovups	%ymm0, -0x80(%r8)
> +	vmovups	%ymm0, -0x60(%r8)
> +	vmovups	%ymm0, -0x40(%r8)
> +	vmovups	%ymm0, -0x20(%r8)
> +	ret
you added vzeroupper at places where it is unnecessary but
here where it is necessary you did not.

References:
- [PATCH RFC] Improve 64bit memset for Corei7 with avx2 instruction
  - From: ling . ma . program

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]