This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file


On Nov  6 16:00, Thomas Wolff wrote:
> Corinna Vinschen wrote:
> >I created a simple testcase:
> >
> >==== SNIP ===
> >...
> >==== SNAP ====
> I extended your test program to demonstrate the inefficiency of the
> standard mbrtowc function. [...]
> >Under Cygwin (tcsh time output):
> >
> >  $ setenv LANG en_US.UTF-8
> >  $ time ./mb 1000000 1 0
> >  with malloc: 1, with mbrtowc: 0
> >  0.328u 0.031s 0:00.34 102.9%    0+0k 0+0io 1834pf+0w
> >  $ time ./mb 1000000 0 1
> >  with malloc: 0, with mbrtowc: 1
> >  1.921u 0.092s 0:02.09 96.1%     0+0k 0+0io 1827pf+0w
> >  $ time ./mb 1000000 1 1
> >  with malloc: 1, with mbrtowc: 1
> >  2.062u 0.140s 0:02.15 102.3%    0+0k 0+0io 1839pf+0w
> >
> >Running on the same CPU under Linux:
> >
> >  $ setenv LANG en_US.UTF-8
> >  $ time ./mb 1000000 1 0
> >  with malloc: 1, with mbrtowc: 0
> >  0.088u 0.004s 0:00.09 88.8%     0+0k 0+0io 0pf+0w
> >  $ time ./mb 1000000 0 1
> >  with malloc: 0, with mbrtowc: 1
> >  1.836u 0.000s 0:01.85 98.9%     0+0k 0+0io 0pf+0w
> >  $ time ./mb 1000000 1 1
> >  with malloc: 1, with mbrtowc: 1
> >  1.888u 0.000s 0:01.93 97.4%     0+0k 0+0io 0pf+0w
> >
> >So, while Linux is definitely faster, the number are still comparable
> >for 1 million iterations.  That still doens't explain why grep is a
> >multitude slower when using UTF-8 as charset.
> Results of mbrtowc vs. utftouni on Linux:
> 
> thw[en_US.UTF-8]@scotty:~/tmp: locale charmap
> UTF-8
> thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0
> with malloc: 0, with mbrtowc: 1, with utftouni: 0
> 
> real    0m2.897s
> user    0m2.836s
> sys     0m0.012s
> thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1
> with malloc: 0, with mbrtowc: 0, with utftouni: 1
> 
> real    0m0.030s
> user    0m0.028s
> sys     0m0.000s
> thw[en_US.UTF-8]@scotty:~/tmp:
> [...]
> The conclusion is, as long as calling mbrtowc is as inefficient, a
> program caring about performance should not use it.

That's sort of an unfair test.  Your utftouni function doesn't care for
mbstate, error, and surrogate pair handling.

Having said that, I just experimented further with mbrtowc, and I was
able to speed up mbrtowc and wcrtomb calls on Cygwin by a factor of
almost 50 per cent, just by reducing the function call depth in newlib,
which is the result of reentrancy and isolation efforts.

Talking about your implementation, if you could come up with a faster
implementation of newlib's __utf8_wctomb/__utf8_mbtowc, it would
certainly be another welcome performance boost.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]