This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp


On 01/31/2014 06:19 PM, OndÅej BÃlka wrote:

I send new implementation three months ago but did not time to ping it
because I have send a replacement here:

https://sourceware.org/ml/libc-alpha/2013-09/msg00556.html

Could you also test this what is faster?

The main benefit comes from the lack of length-specific special cases. This means it's considerably faster on my /usr/share/dict/words sorting benchmark and sorting random binary blobs of random length. If the length is constant across calls and the indirect jump is predicted correctly, it is less clear which version is faster. A pairwise t-test against the BSWAP version on the benchtest output does not yield conclusive results, although looking at the data suggests that yours is faster.

I didn't try to exercise the page-crossing code path, although its byte-wise comparisons would be problematic from a timing oracle point of view.

In case BSWAP turns out to be too slow for some CPUs, it could be
replaced with BSF and a shift (both constant time in current
processors), like we already do in some of the existing memcmp
implementations.

In my implementation I use a test of first 16 bytes.

It really helps when you could read few bytes after end, then test if
match is within. I originaly did that but it did not improve performace
due of misprediction that these check caused. What I have now is from
studying input. In practice a memcmp (x, y, n) which ends early has
typically n < 16 where you could in bitmask of mismatches just set
(n-1)-th bit to one.

Reading past the end is impossible because of mmap. If we're willing to run with a libc-supplied SIGSEGV handler, we could do it, but that would be kind of surprising to application code (even if the handler passed on unrecognized SIGSEGV to handlers). We already tweak sigprocmask etc. when threads are enabled, so there is some precedent, but all this seems rather brittle to me.

(If it weren't for mmap, we could make sure that we never hand out memory close to the end of an allocated arena on the heap.)

When you use a memcmp in search and in 90% of time mismatch is in first
byte then adding that test will improve total time.

I tried to add such an early bailout to the BSWAP version, and it wasn't a win. Obviously, such an early check is totally incompatible with my goal of addressing timing oracles.

If such a change is acceptable in principle, I will make work on
similar adjustments to the other x86 implementations.

I would rather improve new implementation.

Your version definitely looks promising. I think I could live with it if we got rid of the bytewise comparisons in the cross-page case.

Do you have an idea how to benchmark the cross-page case?

--
Florian Weimer / Red Hat Product Security Team


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]