This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Ling Ma <ling dot ma dot program at gmail dot com>
Cc: "H.J. Lu" <hjl dot tools at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, Richard Henderson <rth at twiddle dot net>, Andreas Jaeger <aj at suse dot com>, Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>, Ling Ma <ling dot ml at alibaba-inc dot com>
Date: Thu, 5 Jun 2014 18:32:24 +0200
Subject: Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
Authentication-results: sourceware.org; auth=none
References: <1396850238-29041-1-git-send-email-ling dot ma at alipay dot com> <20140513173616 dot GC5047 at domone dot podge> <20140515201458 dot GA24885 at domone dot podge> <CAOGi=dNmn2bPfB65VoXUGjQ7t6RLVJ2hj2QDarrUjZV75kTbDA at mail dot gmail dot com> <20140530113041 dot GB26528 at domone dot podge> <CAOGi=dPdWegEo1s8=wG4WzOANaQ3x=boLFitQ_wBp+Xf+hxexQ at mail dot gmail dot com> <CAMe9rOqv5RYK1MO2M098n3o50-KmmZJuvsvMmXqkBBt0g3OY_g at mail dot gmail dot com> <CAOGi=dMNyzckY8s3uF0qRpKuqUwHHhzQeyy-j29ydLNn_s9Bog at mail dot gmail dot com>

On Wed, Jun 04, 2014 at 03:00:05PM +0800, Ling Ma wrote:
> H.J
> 
> The website changed IP, now the code is available again:
>  http://www.yunos.org/tmp/memset-avx2.patch ,
> and also gziped as attachment in this mail.
> 
> Thanks
> Ling
>
 
Now performance looks ok for me, but few formating problems.
With these fixed I would be satisfied H.J do you have comments?

There is possible followup to also optimize __bzero like we do in
general case.

Then second followup would be decrease function size by reshuffling
blocks, on several places there are 15/16 free bytes due alignment.

Formatting problems are here:

+	vpxor	%xmm0, %xmm0, %xmm0
+	vmovd %esi, %xmm1
+	mov	%rdi, %rsi
+	mov	%rdi, %rax

here

+L(less_16bytes):
+	vmovd %xmm0, %rcx
+	cmp	$8, %dl
+	jb	L(less_8bytes)
+	mov %rcx, (%rdi)
+	mov %rcx, -0x08(%rsi)
+	ret
+
+	.p2align 4
+L(less_8bytes):
+	cmp	$4, %dl
+	jb	L(less_4bytes)
+	mov %ecx, (%rdi)
+	mov %ecx, -0x04(%rsi)
+	ret

and here

+	mov	%rax, %rsi
+	vmovd %xmm0, %eax
+	mov	%rdx, %rcx

As I mentioned code size one trick is that instructions 
with -128 argument are shorter than with 128. You could save 16
bytes with following modification, however it must be tested if 
it improves performance.


--- x	2014-06-05 18:20:35.313645591 +0200
+++ sysdeps/x86_64/multiarch/memset-avx2.S	2014-06-05
18:22:25.068642767 +0200
@@ -95,7 +95,6 @@
 	.p2align 4
 L(256bytesormore):
 	vinserti128 $1, %xmm0, %ymm0, %ymm0
-	mov	$0x80, %rcx
 	add	%rdx, %rsi
 	mov	%rdi, %r9
 	vmovdqu	%ymm0, (%rdi)
@@ -105,15 +104,15 @@
 	add	%r9, %rdx
 	cmp	$4096, %rdx
 	ja	L(gobble_data)
-	sub	%ecx, %edx
+	add	$-128, %edx
 L(gobble_128_loop):
 	vmovdqa	%ymm0, (%rdi)
 	vmovdqa	%ymm0, 0x20(%rdi)
 	vmovdqa	%ymm0, 0x40(%rdi)
 	vmovdqa	%ymm0, 0x60(%rdi)
-	add	%rcx, %rdi
-	sub	%ecx, %edx
-	jae	L(gobble_128_loop)
+	sub	$-128, %rdi
+	add	$-128, %edx
+	jb	L(gobble_128_loop)
 	vmovdqu	%ymm0, -0x80(%rsi)
 	vmovdqu	%ymm0, -0x60(%rsi)
 	vmovdqu	%ymm0, -0x40(%rsi)

Follow-Ups:
- Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
  - From: H.J. Lu
- Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
  - From: Ling Ma

References:
- Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
  - From: H.J. Lu
- Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
  - From: Ling Ma

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]