This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Scheme style auto-resizing hashtable


On Fri, 23 Oct 1998, Maciej Stachowiak wrote:

> 
> This isn't directly on-topic, but I was going to suggest earlier that
> you change the header alist to a vector despite your readability
> concerns; it should be trivial to write a hashtab-header-alist or
> hashtab-properties in Scheme to translate it, and the more I think
> about it, the more convinced I am it would be a noticeable performance
> win.
> 
yeah, I think you're right.  I earlier replied that I like to do
(car myhastab) to see its stats, but I could just as easily do
(hashtab-stats myhashtab) and maybe throw in some real stats like the
mean/median bucket size, standard deviation, etc. 

I think you're right about making the header a vector.  I'm currently
calling scm_sloppy_assq 4 times upon insertion of a new key-value pair.
(5 times if the current bucket size exceeds the largest bucket size so
far).

> Regarding this proposal, basically nothing can save you from a hash
> function that hashes a lot of things to the exact same value; no
> number of resizes will help. The only hope for salvation here is for
> hashing to significantly change when the size of the range changes, so
> that resizing will actually solve such collisions.

I've found that the expected maximum bucket size seems to be related to
log(number-entries).  I can't prove that, but I am currently modifying
max-bucket-size to grow at this rate and it does seems to work well. 

> Another note: I use both eq? and equal? hashing in my code a fair bit.
> Also it would be nice if your code were able to hash by any of eq?,
> eqv? or equal? (the hashers for the first two should not be hard).  At
> that point, I would suggest that a suitably cleaned up version of your
> code should replace Guile hash tables entirely, even 3% or 5%
> performance overhead is a small price to pay for auto-resizing.
> 

well, all we have to do there is rewrite the basic procedures to
accept different eq predicates, then write a bunch of wrapper functions,
i.e.
  hashtab_setqx(key, value)
would call
  hashtab_basic_setx(key, value, scm_eq_p, my_hasherq)

of course, the hasher functions would all be quite different.

> Another reason that maybe not considering the ultimate hash range in
> the hash function may be a lose: the obvious way to implement `hashq'
> with no mod is to cast the SCM value as an int, this will never
> collide in a word-size space but almost certainly will in many common
> hashtable sizes.

My current tests use a big file filled with unique, and very large, random
numbers.  But still, the bigger the hashtable, the bigger the expected
largest bucket size (relative to average bucket size).

	Jay
	jglascoe@jay.giss.nasa.gov