This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Patricia Trie pseudo-test


On Tue, 2006-08-22 at 01:47 -0400, John Moser wrote:

[...]
> I'm still thinking this is probably a win
> since we're unlikely to branch 8 times from a common prefix and even if
> we do we're suddenly faced with a situation where we CAN assume we're
> winning something*; Drepper are you around, I could use your input.

So I had some math run.


You have N string elements each sharing a common prefix of X characters.

If you arrange these in a trie, you have N branches to follow after the
prefix.

Each direction tested in a branch is identical in operation to a
character comparison.  (I designed the storage format that way
specifically.)

Every string is searched for.  The linker eventually goes through all
symbols, taking every possible search path.

If the strings are in a linked list, you compare the common prefixes 1+2
+...+N times, making X comparisons each time, totaling (1+2+...+N)*X or
X*((N+1) * N)/2 character comparisons.

If the strings are in a trie, you compare the common prefixes N times
and compare each branch direction ((N+1)*N)/2 times, totaling N*X + ((N
+1)*N)/2 branches.

These simplify to (N*(1 + N)*X)/2 (linked list) and (N*(1 + N))/2 + N*X
(patricia trie).

A little help from #math (who helped me figure out the simplified
versions) and we get:

<+Cale> % Reduce[{x*((n + 1)*n)/2 > n*x + ((n + 1)*n)/2, x > 0, n > 0}]
<mbot> Cale: x > 1 && n > (1 + x)/(-1 + x)

Now here's some interesting thoughts:

 X = 2; N > 3/1 (3)
 X = 3; N > 4/2 (2)
 X = 4; N > 5/3 (1.6)

Once we've passed prefixes 4 characters long, any branch will be a gain
because N will be 2 or more.  Also the more directions we branch in, the
more time we save; for example, take a prefix 3 characters long (let's
say namespace std, which is probably like _Z22std anyway) and branch it
3, 5, and 10 ways.
    LL     PT
3:  18     15
5:  45     30
10: 165    85

Or let's try a branch between 2 with prefixes 3, 5, 10, 20, and 50 long:
    LL     PT
3:  9      9
5:  15     13
10: 30     23
20: 60     43
50: 150    103

In other words, this is harmless if prefixes are longer than 2
characters; and a gain if prefixes are longer than 3 characters OR the
number of symbols in the bucket sharing that number of prefixes is
greater than 2.  This is EXTREMELY likely.

Of course, this is still all in theory.  But the numbers are pretty.
> 
-- 
John Moser <john.r.moser@gmail.com>

Attachment: signature.asc
Description: This is a digitally signed message part


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]