This is the mail archive of the guile@cygnus.com mailing list for the guile project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
> > m m k-m > > p(m) = ( ) (1/n) (1 - 1/n) > > k > > That should be k choose m, not m choose k, i.e. - > > / k \ m k-m > p(m) = | | (1/n) (1-1/n) > \ m / > > I assume it's just a typo. Actually I coded it both ways, found that one gave numbers that looked like what I wanted, then typed the wrong one into the report. I have trouble remembering phone numbers too. > These numbers are off. They should all be <=1. They're close enough, The Stirling formula is not a genuine factorial <shrug> > but for the hell of it I computed the above directly in guile: Good to see, I did mine in C using doubles, so guile bignum beats C doubles for both accuracy and usable range. I'll bet that my program got the answer quicker though... Actually, I'm glad that you posted the guile code because it is really short and neat and a good example of solving math problems using scheme. Now I can check that when utilisation is constant, the cumulative probablity curve doesn't change with bucket sizes up to 100. > > Given the low utilisation that such a table would have, the probability > > of this happening with a hash function that approximates a random > > variable is vanishingly small. If it happened, I would be looking for > > bugs elsewhere. Much the same as when I hear a loud explosion I don't > > think, ``what a surprise, by some amazing fluke a huge number of randomly > > moving air molecules all struck my eardrum at exactly the same moment''. > > The two key assumptions are that the hash function approximates a > random variable and that the input data is randomly distributed. > Neither is really the case. I see only one assumption here, the idea of a good hash algorithm is that it converts non-random input data into what looks like a random variable. The distribution of the input data really should not matter. > Have some pity on the poor guy whose > hacking away & has a bug in his hashing function. Do you want to blow > him out of the water with seg faults because of your resizing policy > or should he just see linear performance? I think it's important to > use a method which is robust in the face of all inputs. > > > I think that growing based on the maximum bucket size should be OK > > but attempting to keep constant utilisation is easier to implement and > > easier to understand the behaviour. > > The fact is that they'll all be ok on average in practice because > they'll all exhibit similar behavior on average. Given that, it makes > sense to choose the one which has the most robust worst case behavior, > even if it's unlikely to occur in practice. OK, how about this proposal (for those who panic about the worst case): Define ``bucket vector utilisation'' as the number of non-empty buckets divided by the total buckets (not a difficult number to keep track of). Determine an upper and lower bound for bucket vector utilisation, grow when you get bigger than the upper bound, shrink when you get less than the lower bound (thus still giving some hysteresis). In the average case (based on the probability stuff) bucket vector utilisation will be 1 minus the probability of an empty bucket, from Harvey's figures: table utilisation bucket vector utilisation 1.0 0.635 2.0 0.867 3.0 0.952 So on the average, choosing upper and lower bounds for bucket vector utilisation is much the same as choosing upper and lower bounds for table utilisation. You have to make sure that your upper bound is less than 1.0! Finding really good values may take some fiddling but you have the average case covered and you have some theory to make you feel righteous. On the other hand, in the worst case of all keys falling into the same bucket, the bucket vector utilisation won't change so the table will never grow (i.e. linear performance with no memory loss). It should be pointed out that bucket vector utilisation can never be greater than table utilisation. They will be equal in the rather amazing case when evey key falls into a different bucket which will result in a slightly under utilised table with very fast lookup. Note that it may require more than one grow to GUARANTEE a reduction of bucket vector utilisation but limiting the system to a single grow for each insert should still be sufficient. - Tel