Optimising std::find on x86 and PPC

Tue Dec 14 14:39:00 GMT 2004

Hello,

I recently tried changing the std::find random_access overload to change 
the main loop from:

difference_type __trip_count = (__last - __first) >> 2;
for(; __trip_count > 0 ; --__trip_count) { if(*__first = __val) return 
__first; ++__first; (4 times)  }

to:

Iterator __newlast = __last - (__last - __first) % 4;
for( ; __first < __newlast;){ if(*__first = __val) return __first; 
++__first; (4 times)  }

This knocked about 30% off the time taken on x86 (Note that in a final 
version I'd change the %4 into some kind of &ing and/or shifting).

Unfortunatly, a quick test on Mac OS X by Andrew Pinski (thank you!) 
found that this slightly decreased both performance in terms of both 
space and time on the, as this new version will no longer use the 
specialised "count" operator.

Seeing as so many functions use find internally, it seems silly to not 
include some kind of improvement. However at the same time of course we 
don't want to damage the OSX performance. I suspect the reason the loop 
optimisers can't deal with this by themselves is because we unroll this 
loop 4 iterations manually, which I suspect confuses it.

If anyone with more knowledge than me knows either how I could tweak 
this code so it optimises well on both x86 or PPC, or how hard it might 
be to poke the optimiser so one version of this code can be efficent on 
both processors.

The last option of course is to start having different code for 
different processors.. I'd imagine doing so would only be a route of 
last resort however.

Chris