This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug libc/4403] New: strfry() gives skewed distributions
- From: "aurelien at aurel32 dot net" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sources dot redhat dot com
- Date: 21 Apr 2007 21:36:14 -0000
- Subject: [Bug libc/4403] New: strfry() gives skewed distributions
- Reply-to: sourceware-bugzilla at sourceware dot org
strfry() tries to shuffle its string using random swaps, but it uses the
wrong strategy, and thus not all permutations are equally likely. The
code doing the shuffling itself looks like this:
len = strlen (string);
for (i = 0; i < len; ++i)
{
int32_t j;
char c;
__random_r (&rdata, &j);
j %= len;
c = string[i];
string[i] = string[j];
string[j] = c;
}
In other words, for the string 'abc' j will always be between 0 and 2
(inclusive), giving the following possibilities (all equally likely):
j0 j1 j2 result
0 0 0 cab
0 0 1 bca
0 0 2 bac
0 1 0 cba
0 1 1 acb
0 1 2 abc
0 2 0 bca
0 2 1 abc
0 2 2 acb
1 0 0 cba
1 0 1 acb
1 0 2 abc
1 1 0 cab
1 1 1 bca
1 1 2 bac
1 2 0 acb
1 2 1 bac
1 2 2 bca
2 0 0 acb
2 0 1 bac
2 0 2 bca
2 1 0 abc
2 1 1 cab
2 1 2 cba
2 2 0 bac
2 2 1 cba
2 2 2 cab
Sorting and counting gives us the following distribution:
4 abc
5 acb
5 bac
5 bca
4 cab
4 cba
In other words, this is clearly skewed; some strings will appear 25% more often
than others.
Steinar H. Gunderson <sgunderson@bigfoot.com> proposed the following patch:
--- string/strfry.c 2007-04-21 23:12:47.000000000 +0200
+++ string/strfry.c 2007-04-21 23:22:46.000000000 +0200
@@ -38,17 +38,17 @@
}
len = strlen (string);
- for (i = 0; i < len; ++i)
+ for (i = 0; i < len - 1; ++i)
{
int32_t j;
char c;
__random_r (&rdata, &j);
- j %= len;
+ j %= (len - i);
c = string[i];
- string[i] = string[j];
- string[j] = c;
+ string[i] = string[j + i];
+ string[j + i] = c;
}
return string;
It turns strfry() into a proper Fisher-Yates shuffle. This gives exactly
n! paths for a string with N characters, and since there are n! possible
permutations, this means a one-to-one mapping, and all possibilities are
equally likely.
--
Summary: strfry() gives skewed distributions
Product: glibc
Version: unspecified
Status: NEW
Severity: normal
Priority: P2
Component: libc
AssignedTo: drepper at redhat dot com
ReportedBy: aurelien at aurel32 dot net
CC: glibc-bugs at sources dot redhat dot com
GCC build triplet: x86_64-unknown-linux-gnu
GCC host triplet: x86_64-unknown-linux-gnu
GCC target triplet: x86_64-unknown-linux-gnu
http://sourceware.org/bugzilla/show_bug.cgi?id=4403
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.