This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction

From: OndÅej BÃlka <neleai at seznam dot cz>
To: ling dot ma dot program at gmail dot com
Cc: libc-alpha at sourceware dot org, rth at twiddle dot net, aj at suse dot com, liubov dot dmitrieva at gmail dot com, hjl dot tools at gmail dot com, Ling Ma <ling dot ml at alibaba-inc dot com>
Date: Tue, 13 May 2014 19:36:16 +0200
Subject: Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
Authentication-results: sourceware.org; auth=none
References: <1396850238-29041-1-git-send-email-ling dot ma at alipay dot com>

On Mon, Apr 07, 2014 at 01:57:18AM -0400, ling.ma.program@gmail.com wrote:
> From: Ling Ma <ling.ml@alibaba-inc.com>
> 
> In this patch we take advantage of HSW memory bandwidth, manage to
> reduce miss branch prediction by avoid using branch instructions and
> force destination to be aligned with avx instruction. 
> 
Now when we have a haswell machine on our department I tested this
implementation. Benchmark used and results are here.

http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_avx130514.tar.bz2
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_avx.html

This patch improves large inputs and does not regress
small inputs much which gives a total 10% improvement on gcc test, it
could be improved but it now looks good enough.

I tried two alternatives. First is using avx2 in header(memset_fuse). 
It look it helps, it adds additional 0.5% of performance. However I tried to
crosscheck this with bash shell where comparison is in opposite
direction so I not entirely sure yet, see

http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memset_profile_avx/results_bash/result.html

Second is checking if rep treshold is best one,
this depends on application cache layout I do not have definite answer
yet (memset_rep and memset_avx_v2 variants), when data is in L2 cache we
could lower treshold to 1024 bytes but it slows real inputs for some reason.

> The CPU2006 403.gcc benchmark also indicate this patch improves performance
> from  22.9% to 59% compared with original memset implemented by sse2.
>
I inspected that benchmark with my profiler is not that good as its only simple
part of gcc and two third of total time is spend on 240 long inputs.

A large part of speedup could be explained that avx2 implementation has
a special case branch for 128-256 byte range but current one uses loop.
These distributions are different from other program and running gcc
itself as short inputs are more common there.

> +	ALIGN(4)
> +L(gobble_data):
> +#ifdef SHARED_CACHE_SIZE_HALF
> +	mov	$SHARED_CACHE_SIZE_HALF, %r9
> +#else
> +	mov	__x86_shared_cache_size_half(%rip), %r9
> +#endif
> +	shl	$4, %r9
> +	cmp	%r9, %rdx
> +	ja	L(gobble_big_data)
> +	mov	%rax, %r9
> +	mov	%esi, %eax
> +	mov	%rdx, %rcx
> +	rep	stosb
> +	mov	%r9, %rax
> +	vzeroupper
> +	ret
> +
> +	ALIGN(4)
> +L(gobble_big_data):
> +	sub	$0x80, %rdx
> +L(gobble_big_data_loop):
> +	vmovntdq	%ymm0, (%rdi)
> +	vmovntdq	%ymm0, 0x20(%rdi)
> +	vmovntdq	%ymm0, 0x40(%rdi)
> +	vmovntdq	%ymm0, 0x60(%rdi)
> +	lea	0x80(%rdi), %rdi
> +	sub	$0x80, %rdx
> +	jae	L(gobble_big_data_loop)
> +	vmovups	%ymm0, -0x80(%r8)
> +	vmovups	%ymm0, -0x60(%r8)
> +	vmovups	%ymm0, -0x40(%r8)
> +	vmovups	%ymm0, -0x20(%r8)
> +	vzeroupper
> +	sfence
> +	ret

That loop does seem to help on haswell at all, It is indistingushible from
rep stosb loop above. I used following benchmark to check that with
different sizes but performance stayed same.

#include <stdlib.h>
#include <string.h>
int main(){
 int i;
 char *x=malloc(100000000);
  for (i=0;i<100;i++)
   MEMSET(x,0,100000000);

}

for I in `seq 1 10`; do
echo avx
gcc -L. -DMEMSET=__memset_avx2 -lc_profile big.c
time LD_LIBRARY_PATH=. ./a.out
echo rep
gcc -L. -DMEMSET=__memset_rep -lc_profile big.c
time LD_LIBRARY_PATH=. ./a.out
done

Follow-Ups:
- Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
  - From: Ling Ma
- Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]