This is the mail archive of the guile@sourceware.cygnus.com mailing list for the Guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: hash?


Lars Arvestad <arve@nada.kth.se> writes:

> I am surprised only four characters are used. It can't be that
> expensive to use more. In fact, if I just picked a ready-made
> hash-function, I would be disappointed if not the whole string was
> used to compute the key. You want a fast hash function? Then make one
> yourself, because you know the data! 

scm_strhash is used for hashing guile's symbol tables as well, so
there is a motivation for wanting it to be really fast. It likely
shouldn't be the function used for arbitrary string hashes, though.

> 
> >> Looking at the code I'm curious about the selection of initial hash as
> >> 264 % table size; I'd think it'd be better to make the generated hash
> >> totally dependant on the value being hashed, rather than a fixed start
> >> value for each hash for a particular size table (264 might not be the
> >> best value to go there, either, but it's been a while since I've
> >> looked deeply at hashes, so I might just be being prime-centric).
> 
> MS> True - making hash independant on the table size would also allow
> MS> storing the full (nondivided) hashes into the table for dynamic
> MS> resize/rehash.
> 
> 264 seems to be totally arbitrary and only chosen to offset long
> strings from the short ones. I don't see why 256 was not chosen, or
> why a prime value would make a difference. 

I was thinking it'd be less likely to end up with a poor initial value
in the small table case; however, it's hard to say what, exactly, a
poor initial value would be ;). Using an actual character from the
string still might be a more sensible choice than 264.

> Regarding the code, isn't it possible to simply cast a block of four
> characters as a long? I realize that there can be word alignment
> problems, but given a string pointer, that should not be an
> issue. This way, four bytes/characters could be processed in one
> read. 

Yep, but the ease of getting the characters probably isn't as much the
reason that it only uses a small number of characters than the desire
to keep the total number of other operations down.

-- 
Gregh

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]