This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: DT_GNU_HASH: reducing working set ...


Hi there,

On Thu, 2006-06-29 at 21:39 +0200, Jakub Jelinek wrote:
> More comments later, but as you can see in the patch, undefined
> symbols are never added to .gnu.hash chains, only defined ones

	Ah - cool :-) thanks for that.

> Jakub wrote in reply to:
> On Thu, Jun 29, 2006 at 07:27:06PM +0100, Michael Meeks wrote:
> >       1. Extensibility
>
> On the other side, it is IMHO much better to keep using one
> section for one purpose, as ELF has always been doing.

	Yes - you're quite right - it's crack smoking to want to put -Bdirect
stuff in that section anyway, as you say: a few minutes thought in a
less excited state convinced me of that [ sadly while off line ].

	OTOH - I do think that allowing the stealing of some bits from the hash
code [ for other purposes ] in future is prudent, and need not perform
badly:

	if (!(new_hash ^ l->chain[foo]) & l->new_hash_mask))
		... hit.

	etc. shouldn't be overly slower than a straight comparison surely ?

> Applications keep growing every year and while I currently counted
> ~ 500000 different symbol names, it can soon double, triple and keep
> increasing. 

	From what I can see - your method was to grab ~all the symbols on your
system, and compare hashes for all of them ? and we only got <100
collisions ? I take Ulrich's point that perhaps many of these are in the
same module [be interesting to see] due to highly similar names etc. but
even so this is a tiny collision rate.

>	And, all linker can do for this is try to keep all the chains
> short enough to fit into cachelines.

	So - perhaps you can help me here; I've been working with a mental
model of several levels of cache - whereby the more you can compress the
working set, the faster access will be. Now I appreciate the cache line
argument - that after the 1st access, reads inside the same cache line
are really fast [ and (I infer from Ulrich's post) adjacent cache line
reads are free ].

On Fri, 2006-06-30 at 12:36 -0700, Ulrich Drepper wrote:
> On the other hand, what is gained?  You already said we have rather
> short hash chains.  Most likely not longer than 5 or 6.  I can count the
> number of DSOs with longer chains on my hands and there is not more than
> one longer chain in each DSO.

	Jakub recently posted this sort of thing:

[snip]
libstdc++.so.6.0.8

Histogram for `.gnu.hash' bucket list length (total of 1022 buckets):
 Length  Number     % of total  Coverage
      0  52         (  5.1%)
      1  131        ( 12.8%)      4.1%
      2  225        ( 22.0%)     18.4%
      3  233        ( 22.8%)     40.5%
      4  171        ( 16.7%)     62.2%
      5  115        ( 11.3%)     80.4%
      6  64         (  6.3%)     92.5%
      7  18         (  1.8%)     96.5%
      8  8          (  0.8%)     98.5%
      9  4          (  0.4%)     99.7%
     10  1          (  0.1%)    100.0%
...
For libc-2.4.90.so:
...
Histogram for `.gnu.hash' bucket list length (total of 1011 buckets):
 Length  Number     % of total  Coverage
      0  124        ( 12.3%)
      1  254        ( 25.1%)     12.4%
      2  296        ( 29.3%)     41.2%
      3  198        ( 19.6%)     70.2%
      4  98         (  9.7%)     89.3%
      5  29         (  2.9%)     96.4%
      6  10         (  1.0%)     99.3%
      7  2          (  0.2%)    100.0%
[snip]

	But sure - long chains should be unusual. It's perhaps interesting to
consider where the performance balance between expanding the base .hash
to get more 0xffffffff hits, vs. shrinking it to fit more of it into L1
cache is.

>   Now do the math: even assuming even distribution of the chain we have
> on average 6 chain elements per cache line.  If the linker sorts them
> cache lines we can fit in 14.  That's enough even for the worst chain
> length.

	By "sorts them cache lines" I assume you mean adding padding [perhaps undef]
records / alignment to .dynsym to maximise the length of chain inside
cache lines ? that could certainly be done later; interesting.

> So, on the one hand you save one 4-byte word which in almost no case
> will cause an additional cache line to be loaded.  And even if it would,
> the cache lines are consecutive the the processors prefetching takes
> case of it.  On the other hand you more than double the likelihood to
> have to compare strings.  This is IMO reason enough to not change the
> format.

	Nah - let me do the maths another way and perhaps I'll get the result
I'm looking for ;-) by crunching the above numbers that Jakub provided -
I see an avg. no. of comparisons per chain of:

libstdc++	3.09		# sum({count*length})/sum(count)
glibc		2.03

	So a finger-in-the-air avg. chain length is perhaps ~2.5. [1]

	My proposed alternate design sacrifices 1 bit of the hash, lets (for
the sake of argument) assume this has the most awful performance
degredation of a factor of 10 (instead of perhaps ~2-4).

	The proposed alternate design ('stolen bit') is engineered thus:

	* By sorting ~all not-hashed .dynsym entried to the 'end' we end
	  up with contiguous runs of .dynsym entries [ as now ] in
	  chains, adjacent to each other. Thus - by stealing a
	  terminator bit we avoid needing a chain length. [ the 'offset'
	  in the .hash can then just be a .dynsym index ]

	So - lets also assume we have 10^6 unique symbols across a few
libraries - just for argument.

	It seems then that the likelihood (with a purely random access pattern)
of a cache hit [ ie. not taking a L1/L2 cache miss ;-] is ~=
sizeof(cache)/sizeof(data_set).

	With a 64k L1 cache: and a 4Mb working set (stolen-bit) this gives a
1.6% hit rate. [ for a large L2 of course it's higher, but lets look a
the worst case, smallest cache, where this effect is lowest ]

	If we increase the size of each field by 4 bytes per 2.5 records [ as
per the existing 'chain-len' design ] from our avg 400k chains we get: a
1.6Mb increase (factor of 1.4) to a 5.6Mb working set. This gives us a
lower cache hit rate of 1.1%.

	But - of course; as you point out there is a penalty for hash
collisions - by reducing precision we loose on comparison:

	32bits	- 30/530000 = 0.005% false positives
	31bits	- 10x worse = 0.05%

	So - taking our 1 million record hypothetical set:

		L1 hits		false positives
chain-len.	11k		56
stolen-bit	16k		560

	So - if we assume a (conservative) cost of a false positive of ~10 L2
cache misses [ it's more like 3 ] then this seems reasonable.

	So - of course each false positive is prolly less than the 10 L2 cache
misses necessary to make these compare (right?). As the cache grows, of
course the assumptions start to fail, but with a 2Mb L2 cache

            L1 hits: 64k   L2: 2Mb   false positives
chain-len.  11k            500k      56
stolen-bit  16k            360k      560

	So now we need a false positive to generate 280 L2 misses to get to a
performance parity for the (existing) chain-len solution.

	Of course, it's entirely probable that my numbers, assumptions,
calculations and raison-d'etre are entirely incorrect :-) Of course,
wrt. data set size much more than just searching in the .dynsym section
is going on at link time diluting the argument [etc.] Is any of it even
slightly convincing ?

	Either way - since I now have time; if you're interested I'm happy to
try to hack this up so it can be profiled both ways; presumably
cachegrind can give a nice repeatable answer. If cachegrind shows it
being more cache efficient - would you consider a re-work / simplify ?

	Thanks,

		Michael.

[1] - of course assuming that we almost never hit - which I believe is
the common case.
-- 
 michael.meeks@novell.com  <><, Pseudo Engineer, itinerant idiot



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]