This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction

From: Ling Ma <ling dot ma dot program at gmail dot com>
To: Ondřej Bílka <neleai at seznam dot cz>
Cc: libc-alpha at sourceware dot org, aj at suse dot com, liubov dot dmitrieva at gmail dot com, Ma Ling <ling dot ml at alibaba-inc dot com>
Date: Tue, 30 Jul 2013 21:13:28 +0800
Subject: Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
References: <1375090922-8418-1-git-send-email-ling dot ma dot program at gmail dot com> <20130730121907 dot GA2881 at domone dot kolej dot mff dot cuni dot cz>

>> +	mov	__x86_shared_cache_size_half(%rip), %r9
Ling: our machine has 4 core , each core run 2 hyper thread, 8M last
level cache.
so __x86_shared_cache_size_half(%rip) should give us 512k(8M/8 logic
core/2 ) , so when data size is over 8M LLC(512k * 16), we use
non-temporary store.

Thanks
Ling

2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
> On Mon, Jul 29, 2013 at 05:42:02AM -0400, ling.ma.program@gmail.com wrote:
>> From: Ma Ling <ling.ml@alibaba-inc.com>
>>
>> In this patch we use the similar approach with memcpy to
>> avoid branch instructions and force destination to be aligned
>> with avx instruction. By gcc.403 benchmark we find memset
>> spend more time than memcpy by 5~20 times.
>>
> Another issue is if a big loop is really needed. I tested variant with
> big loop disabled on ivy bridge and for sizes upto 262144 performance is
> about same but from that a rep movsb becomes 20% faster.
>
> Ljuba, could you test also this case?
>
> size: 262144
> 0.44	0.45	0.44
> 0.46	0.44	0.43
> 0.44	0.44	0.44
> 0.45	0.45	0.45
> 0.46	0.44	0.45
> 0.44	0.44	0.46
> 0.45	0.44	0.46
> 0.44	0.44	0.44
> 0.44	0.45	0.45
> 0.48	0.44	0.44
> size: 524288
> 0.54	0.47	0.45
> 0.55	0.45	0.45
> 0.55	0.44	0.46
> 0.53	0.45	0.46
> 0.52	0.45	0.44
> 0.54	0.45	0.44
> 0.54	0.44	0.45
> 0.55	0.44	0.45
> 0.52	0.44	0.46
> 0.54	0.45	0.45
>
>> +	ALIGN(4)
>> +L(gobble_data):
>> +#ifdef SHARED_CACHE_SIZE_HALF
>> +	mov	$SHARED_CACHE_SIZE_HALF, %r9
>> +#else
>> +	mov	__x86_shared_cache_size_half(%rip), %r9
>> +#endif
>> +	shl	$4, %r9
> Getting half of cache size then multiplying it by 16 ?
>> +	cmp	%r9, %rdx
>> +	ja	L(gobble_big_data)
>> +	mov	%rax, %r9
>> +	mov	%esi, %eax
>> +	mov	%rdx, %rcx
>> +	rep	stosb
>> +	mov	%r9, %rax
>> +	vzeroupper
>> +	ret
>> +
>> +	ALIGN(4)
>> +L(gobble_big_data):
>> +	sub	$0x80, %rdx
>> +L(gobble_big_data_loop):
>> +	vmovntdq	%ymm0, (%rdi)
>> +	vmovntdq	%ymm0, 0x20(%rdi)
>> +	vmovntdq	%ymm0, 0x40(%rdi)
>> +	vmovntdq	%ymm0, 0x60(%rdi)
>> +	lea	0x80(%rdi), %rdi
>> +	sub	$0x80, %rdx
>> +	jae	L(gobble_big_data_loop)
>> +	vmovups	%ymm0, -0x80(%r8)
>> +	vmovups	%ymm0, -0x60(%r8)
>> +	vmovups	%ymm0, -0x40(%r8)
>> +	vmovups	%ymm0, -0x20(%r8)
>> +	vzeroupper
>> +	sfence
>> +	ret
>> +
>> +END (MEMSET)
>> +#endif
>
>

References:
- [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: ling . ma . program
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]