This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Tue, 2008-10-28 at 17:00 -0700, Min Zhang wrote:I am using ARMv6 on omap2430 board. I believe the arch ref is http://www.arm.com/miscPDFs/14128.pdf. Here is the /proc/cpuinfo:
This patch improves the execution time of the memset. Tested by "time" shell utility on the following test program. The patch reduced execution time by 50%. Also sanity tested the memset with length from 0 byte to 1000 bytes, just to make sure it doesn't memset any extra or less bytes.
int main() { char* p = malloc(4096); for (int i=0; i<100000; i++) { memset(p, 0, 4096); } }
Note: This patch sort of undo the
http://sources.redhat.com/cgi-bin/cvsweb.cgi/ports/sysdeps/arm/memset.S.diff?r1=1.4&r2=1.5&cvsroot=glibc by reverting "str" back to more efficient block copy "stm" instruction. I am not sure the reason behind the rev 1.5 change.
What CPU are you benchmarking on? I think the reason for the rev 1.5
change was that, on some processors (particularly StrongARM and/or
XScale), a two-word STM is slower than two STRs under most common
circumstances. If I remember right, STM will be 50% slower than STR+STR
on xscale if the writes hit in the cache.
Same result, STM is faster. I reduced length to memset(p, 0, 128) assuming 128 bytes should be small enough to stay in the dcache if I while(1) it. I also tried bigger length like memset(p,0,32k*16) assume none of it will fit in dcache of 32KB, STM still faster.The two circumstances I can think of where your change might be a win are:
- cpus with no icache, where reducing the number of i-fetches is important (presumably not the case for you); or
- cpus whose dcache allocates only on reads (most ARMs are like this) and where STM gives you better external bus utilisation than STR+STR in the case of a miss (I'm not sure offhand on what processors this is true).
Can you try re-benchmarking your change against cached data to see what happens there?
p.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |