This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: ling dot ma dot program at gmail dot com
- Cc: libc-alpha at sourceware dot org, rth at twiddle dot net, aj at suse dot com, liubov dot dmitrieva at gmail dot com, hjl dot tools at gmail dot com, Ling Ma <ling dot ml at alibaba-inc dot com>
- Date: Tue, 13 May 2014 19:36:16 +0200
- Subject: Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
- Authentication-results: sourceware.org; auth=none
- References: <1396850238-29041-1-git-send-email-ling dot ma at alipay dot com>
On Mon, Apr 07, 2014 at 01:57:18AM -0400, ling.ma.program@gmail.com wrote:
> From: Ling Ma <ling.ml@alibaba-inc.com>
>
> In this patch we take advantage of HSW memory bandwidth, manage to
> reduce miss branch prediction by avoid using branch instructions and
> force destination to be aligned with avx instruction.
>
Now when we have a haswell machine on our department I tested this
implementation. Benchmark used and results are here.
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_avx130514.tar.bz2
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_avx.html
This patch improves large inputs and does not regress
small inputs much which gives a total 10% improvement on gcc test, it
could be improved but it now looks good enough.
I tried two alternatives. First is using avx2 in header(memset_fuse).
It look it helps, it adds additional 0.5% of performance. However I tried to
crosscheck this with bash shell where comparison is in opposite
direction so I not entirely sure yet, see
http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memset_profile_avx/results_bash/result.html
Second is checking if rep treshold is best one,
this depends on application cache layout I do not have definite answer
yet (memset_rep and memset_avx_v2 variants), when data is in L2 cache we
could lower treshold to 1024 bytes but it slows real inputs for some reason.
> The CPU2006 403.gcc benchmark also indicate this patch improves performance
> from 22.9% to 59% compared with original memset implemented by sse2.
>
I inspected that benchmark with my profiler is not that good as its only simple
part of gcc and two third of total time is spend on 240 long inputs.
A large part of speedup could be explained that avx2 implementation has
a special case branch for 128-256 byte range but current one uses loop.
These distributions are different from other program and running gcc
itself as short inputs are more common there.
> + ALIGN(4)
> +L(gobble_data):
> +#ifdef SHARED_CACHE_SIZE_HALF
> + mov $SHARED_CACHE_SIZE_HALF, %r9
> +#else
> + mov __x86_shared_cache_size_half(%rip), %r9
> +#endif
> + shl $4, %r9
> + cmp %r9, %rdx
> + ja L(gobble_big_data)
> + mov %rax, %r9
> + mov %esi, %eax
> + mov %rdx, %rcx
> + rep stosb
> + mov %r9, %rax
> + vzeroupper
> + ret
> +
> + ALIGN(4)
> +L(gobble_big_data):
> + sub $0x80, %rdx
> +L(gobble_big_data_loop):
> + vmovntdq %ymm0, (%rdi)
> + vmovntdq %ymm0, 0x20(%rdi)
> + vmovntdq %ymm0, 0x40(%rdi)
> + vmovntdq %ymm0, 0x60(%rdi)
> + lea 0x80(%rdi), %rdi
> + sub $0x80, %rdx
> + jae L(gobble_big_data_loop)
> + vmovups %ymm0, -0x80(%r8)
> + vmovups %ymm0, -0x60(%r8)
> + vmovups %ymm0, -0x40(%r8)
> + vmovups %ymm0, -0x20(%r8)
> + vzeroupper
> + sfence
> + ret
That loop does seem to help on haswell at all, It is indistingushible from
rep stosb loop above. I used following benchmark to check that with
different sizes but performance stayed same.
#include <stdlib.h>
#include <string.h>
int main(){
int i;
char *x=malloc(100000000);
for (i=0;i<100;i++)
MEMSET(x,0,100000000);
}
for I in `seq 1 10`; do
echo avx
gcc -L. -DMEMSET=__memset_avx2 -lc_profile big.c
time LD_LIBRARY_PATH=. ./a.out
echo rep
gcc -L. -DMEMSET=__memset_rep -lc_profile big.c
time LD_LIBRARY_PATH=. ./a.out
done