This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: avl-trees vs. hashes


Maciej Stachowiak <mstachow@mit.edu> writes:

 > If you grow by a factor of K>1 only whenever the hashtable has more
 > than E/F (where E is the maximum number of elements and F >= 1)
 > elements, and shrink by a factor of M>1 whenever the hashtable has
 > fewer than E/G elemenets where G>F, I believe you will get amortized
 > constant time (assuming your hash function distributes well enough
 > that collisions are not a significant factor). F and G should
 > obviously be picked such that the difference between thresholds is
 > considerably less than one element for common sizes, or you will get
 > crappy performance anyway for the sizes you care about, regardless of
 > the asymptotic complexity.

I take it you meant to grow the table by a factor of K > 1 when

   N_elements/N_buckets = avg_bucket_size > MAX_AVG_BUCKET_SIZE

and to shrink it by K when

   N_elements/N_buckets = avg_bucket_size < MIN_AVG_BUCKET_SIZE,

and that we want (obviously):

   MIN_AVG_BUCKET_SIZE < MAX_AVG_BUCKET_SIZE

It looks like one should have:

   MIN_AVG_BUCKET_SIZE/K^2 = MAX_AVG_BUCKET_SIZE

otherwise after shrinking by K the average bucket size might be close
enough to MAX_AVG_BUCKET_SIZE to cause the table to grow again after
insertion of a small number of items.

Example 1 - MIN_AVG_BUCKET_SIZE = 2, MAX_AVG_BUCKET_SIZE = 4, K = 2.
Suppose a deletion has caused the hash table to have 256 items & 128
buckets yielding an average bucket size of 2.  The hash table will
resize to 64 buckets and the average bucket size will be 4.  The next
insertion will make the average bucket size greater than
MAX_AVG_BUCKET_SIZE, causing the hash table to double in size again,
growing to 128 buckets.  This can repeat forever, thus causing
resizing for every pair of deletion/insertions for any given hash
table size.  This is bad.

Example 2 - MIN_AVG_BUCKET_SIZE = 1, MAX_AVG_BUCKET_SIZE = 4, K = 2.
Now when the MIN_AVG_BUCKET_SIZE is hit, halving the size of the hash
table will yield an average bucket size of 2.  The hash table is in
the same state as after the previous doubling of table size.  The
hash table won't resize from insertions until (2 x number_of_buckets)
insertions.  This is good.

And, yes - when sizing based on max bucket size it's probably much
easier to get into a bad state as in Example 1.

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il