This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: hash table subtleties


On 29 Oct 1998, Harvey J. Stein wrote:

> Jay Glascoe <jglascoe@jay.giss.nasa.gov> writes:
> 
>  > The magic ratio is chosen so that the expected mean nonempty bucket size
>  > after upsize is the same as it is after downsize.  (and then, e.g. if a
>  > user inserts just enough entries for the table to grow to 4096 buckets, he
>  > must delete half of them before it will shrink back down to 2048). 
> 
> Yes, this is the property you want, which would be that the
> min_avg_bucket_size = max_avg_bucket_size/growth_factor^2.  So why do
> you say 3/4 is too small?  If the avg bucket size is 3/4 & you half
> the table size, then the mean bucket size will be 3/2 = what you have
> after doubling when you've hit the max_avg_bucket_size of 3.

I think my magic ratio is basically an estimate of

3 * 1/2 * (number nonempty buckets) / (number buckets)

at the time when

(number entries) / (number nonempty buckets) exceeds 3

> The thing I'm most concerned about is averaging the nonempty bucket
> sizes instead of all the bucket sizes.  If you resize on average
> bucket size then in the worst case (where everything hashes to the
> same bucket) you'll have 1 bucket with 3*N items & a hash table of
> size N, yielding O(N) lookup times.  If you resize on average
> *nonempty* bucket size, then your worst case is O(2^N)!!!  This is the
> difference btw bad performance in the worst case & crashing with out
> of memory in the worst case.  If the input data all lands in one
> bucket, then once that bucket passes max_avg_bucket_size, you'll
> double the hash table for every insertion.

well, *sigh*, maybe you're right.  It's a good thing I haven't etched my
C code in stone  ;)   I must admit, that worst worst case has happened to
me a number of times when I was using the Guile hasher on anything other
than numbers.  The string hasher does this

h = ((h << 8) + ((unsigned) (scm_downcase (str[h % len])))) % n;

that "h << 8" is very unfriendly to hash tables with sizes powers of 2.
I don't even want to talk about how Guile hashes real numbers.  (HINT: it
converts them to strings...!)

> This will happen (slightly slower) whenever you end up with a small
> bounded number of hash values.
> 
> When the hash table is well behaved it won't make a difference.  When
> the hash table is missing buckets (probably common), using
> nonempty_bucket_size_average will resize a little sooner, but it
> probably doesn't make much of a difference.  When the hash function is
> clustering, then using nonempty_bucket_size_average will rehash much
> sooner, keeping the alists shorter, but at the expense of having a
> larger vector.  In the worst cases, using nonempty_bucket_size_average
> causes catastrophic failure whereas using bucket_size_average will be
> slow but not catastrophic.  Is the probably marginal speedup worth the
> risk of catastrophic failure?

*sigh*

	Jay
	jglascoe@jay.giss.nasa.gov