This is the mail archive of the newlib@sources.redhat.com mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

newlib/libc/machine/sh/memcpy.S pessimization


I took a close look at newlib/libc/machine/sh/memcpy.S recently,
and I noticed it has been pessimized it for the SH4.

For this comparision, the relevant SH4 architectural details are:

o In-order dual-issue processor
o Conceptually has 6 functional units (MT, EX, BR, LS, FE, and CO)
o Two MT group instructions can be issued simultaneously
o CO group instructions cannot pair with other instructions
o Other instructions combinations pair as long as the two insns
  do not use the same functional unit
o Has separate direct-mapped I & D caches (8k and 16k respectively)
   with no victim cache
o Has no snooping logic on the write queue
  (i.e. must flush write queue prior to a read)
o Embedded DRAM controller supports RAS-down mode

This is the original code in newlib which I wrote about two years ago
for the little-endian x10 -> x00 copy loop:

copy_10_00_loop:
	mov.l	@r5+,r1	
	dt	r7
	mov.l	@r5+,r2
	xtrct	r1,r0
	mov.l	r0,@r4
	xtrct	r2,r1
	mov	r2,r0
	mov.l	r1,@(4,r4)
	bf/s	copy_10_00_loop
	add	#8,r4

Here is a quick analysis of the pipelines while this loop is executing:

CLK 1:	mov.l	@r5+,r1		; LS group, issue 1, latency 2
	dt	r7		; EX group, issue 1, latency 1

CLK 2:	mov.l	@r5+,r2		; LS group, issue 1, latency 2
	(xtrct	r1,r0)		; Stalled on r1 data dependency

CLK 3:	(stalled)
	(stalled)

CLK 4:	(stalled)
	xtrct	r1,r0		; EX group, issue 1, latency 1 

CLK 5:	mov.l	r0,@r4		; LS group, issue 1, latency 1
	xtrct	r2,r1		; EX group, issue 1, latency 1 

	mov	r2,r0		; MT group, issue 1, latency 1
	mov.l	r1,@(4,r4)	; LS group, issue 1, latency 1

CLK 6:	bf/s	copy_10_00_loop	; BR group, issue 1, latency 2
	add	#8,r4		; EX group, issue 1, latency 1

CLK 7:	(branching)
	(branching)

So, this loop requires 7 clock cycles per iteration to execute.

This is the code currently in newlib:

	mov.l	@r5+,r2
	xtrct	r2,r1
	mov.l	r1,@(r0,r5)
	cmp/hs	r7,r5
	mov.l	@r5+,r1
	xtrct	r1,r2
	mov.l	r2,@(r0,r5)
	bf	L_cleanup

Here is a quick analysis of the pipelines while this loop is executing:

CLK 1:	mov.l	@r5+,r2		; LS group, issue 1, latency 2
	(xtrct	r2,r1)		; Stalled on r2 data dependency

CLK 2: 	(stalled)
	(stalled)

CLK 3:	(stalled)
	xtrct	r2,r1		; EX group, issue 1, latency 1

CLK 4:	mov.l	r1,@(r0,r5)	; LS group, issue 1, latency 1
	cmp/hs	r7,r5		; MT group, issue 1, latency 1

CLK 5:	mov.l	@r5+,r1		; LS group, issue 1, latency 2
	(xtrct	r1,r2)		; stalled - data dependency on r1

CLK 6:	(stalled)
	(stalled)

CLK 7:	(stalled)
	xtrct	r1,r2		; EX group, issue 1, latency 1

CLK 8:	mov.l	r2,@(r0,r5)	; MT group, issue 1, latency 1
	bf	L_cleanup	; BR group, issue 1, latency 2

CLK 9:	(branching)
	(branching)

This code takes 9 clocks per loop iteration.

Also, this new code performs read/write/read/write instead of
read/read/write/write. This is extremely bad because:

o The SH4 has a non-snooped write queue.

  This means: on a read, the SH4 cannot check the write
  queue to check if the read address has a pending write.

  So, on a read instruction, the SH4 must ensure the write
  queue is empty by forcing a flush of the write queue.

  This can cause the read instruction to stall more than
  the raw specified latency, if I'm not mistaken.

  The new code forces a write queue flush twice per loop
  iteration, whereas the old code only forced it once
  per loop iteration.

o The SH4 has a direct-mapped data cache.

  If the source and destination map to the same cache line
  (i.e. are an even multiple of 16k apart) and the caches are
  in write-back mode, then the new loop will force a cache line
  thrash twice per loop instead of once.

  Note that this is not a problem in write-through mode, but
  since write-back mode is typically 2x faster for applications
  than write-through mode, I don't think anyone is using the
  write-through cache-mode.

o The SH2/3/4 embedded DRAM controllers support RAS-down mode

  Electrically, DRAM is organized in a two-dimensional matrix of bits.
  To reduce the number of address pins required to address the large 
  number of bits, the address is loaded in two stages.

  The /RAS line (Row Address Strobe) loads the row from the address
  pins, and the /CAS line (Column Address strobe loads the column
  from the address pins.

  If the current access to DRAM is in the same row as the previous
  access, then the DRAM controller can strobe /CAS only and avoid
  reloading the row address. This is called "RAS-down mode" in Hitachi
  parlance (I think everyone else calls it "page mode").

  Typically, each row is 1kbyte - 8kbytes long.

  Essentially, if the source and destination addresses of the memcpy
  are on the same DRAM device, but in different rows (pages), then
  DRAM RAS thrashing will occur twice per iteration instead of
  once per iteration in the new code.

I haven't checked, but I believe both the new code and the old code
execute in 11 clocks on an SH2 and SH3, so I believe the new code
has no benefits for the SH2 or SH3 either.

If there is a request for a clock-by-clock analysis on the SH2/SH3,
I can download the databooks and post one.

Toshi


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]