This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file


On Nov  6 14:09, Thomas Wolff wrote:
> Christopher Faylor wrote:
> >On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote:
> >>aputerguy wrote:
> >>>Running grep on a 20MB file with ~100,000 matches takes an incredible almost
> >>>8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5
> >>>(on a 2nd machine).
> >>I've seen nasty behavior with grep that isnt' cygwin specific.  Try
> >>"pcregrep" and see if you have the same issue.
> >>
> >>I found it to be about ~100 times faster under _some_ searches though
> >>2-3x is more typical.  The gnu re-parser isn't real efficient under
> >>some circumstances.
> >>
> >>If you find a big difference, you might also want to report it to the
> >>bug-grep@gnu.org mailing list, but last time I did, they told me
> >>"that's the way it is" due to some posix conformance thing...
> >
> >The fact that it behaves differently between Cygwin 1.5 and 1.7 would
> >suggest that this isn't a grep problem.
> This is likely to be triggered by the transition to UTF-8 as a
> default charset. The same problem is observed on Linux, with grep as
> well as with sed.
> That's why I have changed most of my shell scripts to use something like
> LC_ALL=C grep or LC_ALL=C sed
> where possible. Please try this.

Or try LANG=C.ASCII since LANG=C will still return UTF-8 as charset
when calling nl_langinfo(CHARSET).

> The problem *is* with grep (and sed), however, because there is no
> good reason that UTF-8 should give us a penalty of being 100times
> slower on most search operations, this is just poor programming of
> grep and sed.

The penalty on Linux is much smaller, about 15-20%.  It looks like
grep is calling malloc for every input line if MB_CUR_MAX is > 1.
Then it evaluates for each byte in the line whether the byte is a
single byte or the start of a multibyte sequence using mbrtowc on
every charatcer on the input line.  Then, for each potential match,
it checks if it's the start byte of a multibyte sequence and ignores
all other matches.  Eventually, it calls free, and the game starts
over for the next line.

It appears that either our malloc is that slow, or the mbrtowc call.
But I can't really believe the latter.  The function should be quite
fast, as far as I can see...


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]