This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v1.2] Improve unaligned memcpy and memmove.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Fri, 4 Oct 2013 14:52:48 +0200
- Subject: Re: [PATCH v1.2] Improve unaligned memcpy and memmove.
- Authentication-results: sourceware.org; auth=none
- References: <20130819085220 dot GB19541 at domone> <20130829153829 dot GA6105 at domone dot kolej dot mff dot cuni dot cz> <20131003220926 dot GA12203 at domone dot podge> <CAHjhQ93gDTLC9jh56PPXPf0DndUBxVd371Xpw1+vPM9HVnHHfw at mail dot gmail dot com>
On Fri, Oct 04, 2013 at 03:14:04PM +0400, Liubov Dmitrieva wrote:
> I don't understand why you use HAS_SLOW_SSE4_2 flag for Silvermont
> version. It is supposed to be named as "Fast_Rep" or something like that
> to make the core feature of the version be clear.
> There is already HAS_FAST_REP_STRING, maybe it can be reused.
> --
> Liubov
>
It was simplest way to identify silvermont. It is exceptional that rep
movsq is faster on L1 cache for sizes more than 4096 bytes. For core2 a
situation is opposite, rep movsq looks fastest for small sizes (upto 256
bytes) until ssse3 loop pays itself.
It might make sense to do silvermont specific casing as below.
Or there is second possibility that a switching to rep would be done by
a processor specific table. For silvermont threshold would be 4096
bytes.
On nehalem and ivy bridge a loop is faster when data are in L1 cache,
nearly identical for L2 cache and by far best possible for L3 cache and
more so we could use treshold of 65636. On fx10 a rep implementation is
always slower so we would need to disable it.
---
sysdeps/x86_64/multiarch/init-arch.c | 3 ++-
sysdeps/x86_64/multiarch/init-arch.h | 6 ++++++
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/sysdeps/x86_64/multiarch/init-arch.c b/sysdeps/x86_64/multiarch/init-arch.c
index 5583961..b80d9f2 100644
--- a/sysdeps/x86_64/multiarch/init-arch.c
+++ b/sysdeps/x86_64/multiarch/init-arch.c
@@ -90,7 +90,8 @@ __init_cpu_features (void)
__cpu_features.feature[index_Fast_Unaligned_Load]
|= (bit_Fast_Unaligned_Load
| bit_Prefer_PMINUB_for_stringop
- | bit_Slow_SSE4_2);
+ | bit_Slow_SSE4_2
+ | bit_Is_Silvermont);
break;
default:
diff --git a/sysdeps/x86_64/multiarch/init-arch.h b/sysdeps/x86_64/multiarch/init-arch.h
index 0cb5f5b..36ec445 100644
--- a/sysdeps/x86_64/multiarch/init-arch.h
+++ b/sysdeps/x86_64/multiarch/init-arch.h
@@ -24,6 +24,8 @@
#define bit_FMA_Usable (1 << 7)
#define bit_FMA4_Usable (1 << 8)
#define bit_Slow_SSE4_2 (1 << 9)
+#define bit_Is_Silvermont (1 << 10)
+
/* CPUID Feature flags. */
@@ -64,6 +66,7 @@
# define index_FMA_Usable FEATURE_INDEX_1*FEATURE_SIZE
# define index_FMA4_Usable FEATURE_INDEX_1*FEATURE_SIZE
# define index_Slow_SSE4_2 FEATURE_INDEX_1*FEATURE_SIZE
+# define index_Is_Silvermont FEATURE_INDEX_1*FEATURE_SIZE
#else /* __ASSEMBLER__ */
@@ -163,6 +166,8 @@ extern const struct cpu_features *__get_cpu_features (void)
# define index_FMA_Usable FEATURE_INDEX_1
# define index_FMA4_Usable FEATURE_INDEX_1
# define index_Slow_SSE4_2 FEATURE_INDEX_1
+# define index_Is_Silvermont FEATURE_INDEX_1
+
# define HAS_ARCH_FEATURE(name) \
((__get_cpu_features ()->feature[index_##name] & (bit_##name)) != 0)
@@ -174,5 +179,6 @@ extern const struct cpu_features *__get_cpu_features (void)
# define HAS_AVX HAS_ARCH_FEATURE (AVX_Usable)
# define HAS_FMA HAS_ARCH_FEATURE (FMA_Usable)
# define HAS_FMA4 HAS_ARCH_FEATURE (FMA4_Usable)
+# define IS_SILVERMONT HAS_ARCH_FEATURE (Is_Silvermont)
#endif /* __ASSEMBLER__ */
--
1.8.4.rc3