This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file


On Nov  8 11:30, Corinna Vinschen wrote:
> On Nov  7 15:26, aputerguy wrote:
> > 
> > Changing LC_ALL also solved the problem for me.
> > But it begs the question of how many other basic and take-for-granted
> > functions might be affected by this apparent UTF-8 slowdown. And again we,
> > are not talking about some minor overhead, we are talking about a slowdown
> > of 1500X or 150,000%!!!!
> 
> Yeah, that's really still strange to me.  In my testing, the multibyte
> to widechar conversion performed by grep in case of UTF-8 took only
> 1.5 up to 4 seconds for 10 times the number of input lines as in your
> case.  It still puzzles me where the time is wasted in grep.

Got it.  The problem is this.

Grep reads the file in chunks > pagesize.  Pagesize is 64K on Cygwin.
For each buffer read into memory, the grepbuf() function calls the
execute() function as long as it returns a match.

The execute() function calls check_multibyte_string() for the entire
buffer(!), then calls kwsexec() to find a match.  If a match has been
found, it free's the memory allocated by check_multibyte_string()
and returns to grepbuf.  Then grepbuf() will call execute again with
the pointers into the buffer moved to the next line.

Let's make an example.  Assume the buffer is 100K, which is not unusual
when running under Cygwin.  Assume further that the file consists of
100,000 lines with the text "The quick brown fox jumped over the lazy dog".
Each line is 45 bytes, so the buffer contains somwhat more than 2200
lines.  Now let's search for the expression "dog".

The first call to execute will call check_multibyte_string() for the
entire buffer of 100000 bytes.  Then it finds a match in the first line,
free's the check_multibyte_string() memory and returns to grepbuf.
grepbuf calls execute with the start pointer moved to the second line in
the buffer, so execute() calls check_multibyte_string() for the remainder
of the buffer, which is 99955 bytes.  It find a match in the first line,
free's the check_multibyte_string buffer, returns to grepbuf, which calls
execute, which calls check_multibyte_string() with a buffer of 99910
bytes, and so on...

Every invocation of check_multibyte_string() calls mbrtowc() in a loop
for the entire buffer given as argument.  In our example, that means
that mbrtowc() is called (hold your breath)

  111,161,115

times for each 100K of input file!  No wonder grep takes 3 or 4 minutes
to grep this very example on Cygwin.

I really think there's some room for optimization left in this algorithm.

Btw., the check for mmap in grep's configure file is broken.  It tries
to mmap to a fixed address formerly allocated via malloc().  This doesn't
work on Windows.  An autoconf run with a newer version of autoconf would
be nice.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]