This is the mail archive of the guile@cygnus.com mailing list for the guile project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
On 29 Oct 1998, Harvey J. Stein wrote: > Jay Glascoe <jglascoe@jay.giss.nasa.gov> writes: > > > The magic ratio is chosen so that the expected mean nonempty bucket size > > after upsize is the same as it is after downsize. (and then, e.g. if a > > user inserts just enough entries for the table to grow to 4096 buckets, he > > must delete half of them before it will shrink back down to 2048). > > Yes, this is the property you want, which would be that the > min_avg_bucket_size = max_avg_bucket_size/growth_factor^2. So why do > you say 3/4 is too small? If the avg bucket size is 3/4 & you half > the table size, then the mean bucket size will be 3/2 = what you have > after doubling when you've hit the max_avg_bucket_size of 3. I think my magic ratio is basically an estimate of 3 * 1/2 * (number nonempty buckets) / (number buckets) at the time when (number entries) / (number nonempty buckets) exceeds 3 > The thing I'm most concerned about is averaging the nonempty bucket > sizes instead of all the bucket sizes. If you resize on average > bucket size then in the worst case (where everything hashes to the > same bucket) you'll have 1 bucket with 3*N items & a hash table of > size N, yielding O(N) lookup times. If you resize on average > *nonempty* bucket size, then your worst case is O(2^N)!!! This is the > difference btw bad performance in the worst case & crashing with out > of memory in the worst case. If the input data all lands in one > bucket, then once that bucket passes max_avg_bucket_size, you'll > double the hash table for every insertion. well, *sigh*, maybe you're right. It's a good thing I haven't etched my C code in stone ;) I must admit, that worst worst case has happened to me a number of times when I was using the Guile hasher on anything other than numbers. The string hasher does this h = ((h << 8) + ((unsigned) (scm_downcase (str[h % len])))) % n; that "h << 8" is very unfriendly to hash tables with sizes powers of 2. I don't even want to talk about how Guile hashes real numbers. (HINT: it converts them to strings...!) > This will happen (slightly slower) whenever you end up with a small > bounded number of hash values. > > When the hash table is well behaved it won't make a difference. When > the hash table is missing buckets (probably common), using > nonempty_bucket_size_average will resize a little sooner, but it > probably doesn't make much of a difference. When the hash function is > clustering, then using nonempty_bucket_size_average will rehash much > sooner, keeping the alists shorter, but at the expense of having a > larger vector. In the worst cases, using nonempty_bucket_size_average > causes catastrophic failure whereas using bucket_size_average will be > slow but not catastrophic. Is the probably marginal speedup worth the > risk of catastrophic failure? *sigh* Jay jglascoe@jay.giss.nasa.gov