1.7] BUG - GREP slows to a crawl with large number of matches on a single file

Fri Nov 6 13:58:00 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Corinna Vinschen on 11/6/2009 6:51 AM:
>> The problem *is* with grep (and sed), however, because there is no
>> good reason that UTF-8 should give us a penalty of being 100times
>> slower on most search operations, this is just poor programming of
>> grep and sed.
> 
> The penalty on Linux is much smaller, about 15-20%.  It looks like
> grep is calling malloc for every input line if MB_CUR_MAX is > 1.
> Then it evaluates for each byte in the line whether the byte is a
> single byte or the start of a multibyte sequence using mbrtowc on
> every charatcer on the input line.  Then, for each potential match,
> it checks if it's the start byte of a multibyte sequence and ignores
> all other matches.  Eventually, it calls free, and the game starts
> over for the next line.

Adding bug-grep, since this slowdown caused by additional mallocs is
definitely the sign of a poor algorithm that could be improved by reusing
existing buffers.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9@byu.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkr0KxUACgkQ84KuGfSFAYCOCACgvjz2v65vK8DIcGg6zfnLQgcT
tfQAmwbpWbriBJSv0rjYobYgsh4KXOiZ
=B3nZ
-----END PGP SIGNATURE-----

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple