This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file


On Nov  6 06:56, Eric Blake wrote:
> According to Corinna Vinschen on 11/6/2009 6:51 AM:
> >> The problem *is* with grep (and sed), however, because there is no
> >> good reason that UTF-8 should give us a penalty of being 100times
> >> slower on most search operations, this is just poor programming of
> >> grep and sed.
> > 
> > The penalty on Linux is much smaller, about 15-20%.  It looks like
> > grep is calling malloc for every input line if MB_CUR_MAX is > 1.
> > Then it evaluates for each byte in the line whether the byte is a
> > single byte or the start of a multibyte sequence using mbrtowc on
> > every charatcer on the input line.  Then, for each potential match,
> > it checks if it's the start byte of a multibyte sequence and ignores
> > all other matches.  Eventually, it calls free, and the game starts
> > over for the next line.
> 
> Adding bug-grep, since this slowdown caused by additional mallocs is
> definitely the sign of a poor algorithm that could be improved by reusing
> existing buffers.

I created a simple testcase:

==== SNIP ===
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <string.h>

int main (int argc, char **argv)
{
  const char in[] = "The quick brown fox jumps over the lazy dog";
  int line, i;
  mbstate_t mbs;
  size_t mbclen;
  size_t size = sizeof (in);
  wchar_t wc;
  int lines = argc > 1 ? atoi (argv[1]) : 1000;
  int do_malloc = 1;
  int do_mbrtowc = 1;

  if (argc > 2)
    do_malloc = atoi (argv[2]);
  if (argc > 3)
    do_mbrtowc = atoi (argv[3]);

  printf ("with malloc: %d, with mbrtowc: %d\n", do_malloc, do_mbrtowc);

  memset (&mbs, 0, sizeof mbs);
  for (line = 0; line < lines; ++line)
    {
      char *x;
      if (do_malloc) x = malloc (size);
      if (do_mbrtowc)
        for (i = 0; i < size; i += mbclen)
          if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0)
            break;
      if (do_malloc) free (x);
    }
  return 0;
}
==== SNAP ====

Under Cygwin (tcsh time output):

  $ setenv LANG en_US.UTF-8
  $ time ./mb 1000000 1 0
  with malloc: 1, with mbrtowc: 0
  0.328u 0.031s 0:00.34 102.9%    0+0k 0+0io 1834pf+0w
  $ time ./mb 1000000 0 1
  with malloc: 0, with mbrtowc: 1
  1.921u 0.092s 0:02.09 96.1%     0+0k 0+0io 1827pf+0w
  $ time ./mb 1000000 1 1
  with malloc: 1, with mbrtowc: 1
  2.062u 0.140s 0:02.15 102.3%    0+0k 0+0io 1839pf+0w

Running on the same CPU under Linux:

  $ setenv LANG en_US.UTF-8
  $ time ./mb 1000000 1 0
  with malloc: 1, with mbrtowc: 0
  0.088u 0.004s 0:00.09 88.8%     0+0k 0+0io 0pf+0w
  $ time ./mb 1000000 0 1
  with malloc: 0, with mbrtowc: 1
  1.836u 0.000s 0:01.85 98.9%     0+0k 0+0io 0pf+0w
  $ time ./mb 1000000 1 1
  with malloc: 1, with mbrtowc: 1
  1.888u 0.000s 0:01.93 97.4%     0+0k 0+0io 0pf+0w

So, while Linux is definitely faster, the number are still comparable
for 1 million iterations.  That still doens't explain why grep is a
multitude slower when using UTF-8 as charset.


Puzzled,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]