1.7] BUG - GREP slows to a crawl with large number of matches on a single file

Fri Nov 6 15:17:00 GMT 2009

Corinna Vinschen wrote:
> On Nov  6 06:56, Eric Blake wrote:
>   
>> According to Corinna Vinschen on 11/6/2009 6:51 AM:
>>     
>>>> The problem *is* with grep (and sed), however, because there is no
>>>> good reason that UTF-8 should give us a penalty of being 100times
>>>> slower on most search operations, this is just poor programming of
>>>> grep and sed.
>>>>         
>>> The penalty on Linux is much smaller, about 15-20%.  
On my Linux system, the penalty is a factor between 6 and 8.
>>> It looks like
>>> grep is calling malloc for every input line if MB_CUR_MAX is > 1.
>>> Then it evaluates for each byte in the line whether the byte is a
>>> single byte or the start of a multibyte sequence using mbrtowc on
>>> every charatcer on the input line.  Then, for each potential match,
>>> it checks if it's the start byte of a multibyte sequence and ignores
>>> all other matches.  Eventually, it calls free, and the game starts
>>> over for the next line.
>>>       
>> Adding bug-grep, since this slowdown caused by additional mallocs is
>> definitely the sign of a poor algorithm that could be improved by reusing
>> existing buffers.
>>     
>
> I created a simple testcase:
>
> ==== SNIP ===
> ...
> ==== SNAP ====
>   
I extended your test program to demonstrate the inefficiency of the 
standard mbrtowc function. Instead I use a function from my editor 
(mined) to extract a Unicode character from a UTF-8 sequence. This is 
the simple case only, not converting character sets other than UTF-8 but 
that's the same thing mbrtowc does in the sample invocation. Program 
attached. Results below.
> Under Cygwin (tcsh time output):
>
>   $ setenv LANG en_US.UTF-8
>   $ time ./mb 1000000 1 0
>   with malloc: 1, with mbrtowc: 0
>   0.328u 0.031s 0:00.34 102.9%    0+0k 0+0io 1834pf+0w
>   $ time ./mb 1000000 0 1
>   with malloc: 0, with mbrtowc: 1
>   1.921u 0.092s 0:02.09 96.1%     0+0k 0+0io 1827pf+0w
>   $ time ./mb 1000000 1 1
>   with malloc: 1, with mbrtowc: 1
>   2.062u 0.140s 0:02.15 102.3%    0+0k 0+0io 1839pf+0w
>
> Running on the same CPU under Linux:
>
>   $ setenv LANG en_US.UTF-8
>   $ time ./mb 1000000 1 0
>   with malloc: 1, with mbrtowc: 0
>   0.088u 0.004s 0:00.09 88.8%     0+0k 0+0io 0pf+0w
>   $ time ./mb 1000000 0 1
>   with malloc: 0, with mbrtowc: 1
>   1.836u 0.000s 0:01.85 98.9%     0+0k 0+0io 0pf+0w
>   $ time ./mb 1000000 1 1
>   with malloc: 1, with mbrtowc: 1
>   1.888u 0.000s 0:01.93 97.4%     0+0k 0+0io 0pf+0w
>
> So, while Linux is definitely faster, the number are still comparable
> for 1 million iterations.  That still doens't explain why grep is a
> multitude slower when using UTF-8 as charset.
>   
Results of mbrtowc vs. utftouni on Linux:

thw[en_US.UTF-8]@scotty:~/tmp: locale charmap
UTF-8
thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0
with malloc: 0, with mbrtowc: 1, with utftouni: 0

real    0m2.897s
user    0m2.836s
sys     0m0.012s
thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1
with malloc: 0, with mbrtowc: 0, with utftouni: 1

real    0m0.030s
user    0m0.028s
sys     0m0.000s
thw[en_US.UTF-8]@scotty:~/tmp:

Results of mbrtowc vs. utftouni on cygwin:

demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 1 0
with malloc: 0, with mbrtowc: 1, with utftouni: 0

real    0m3.034s
user    0m2.974s
sys     0m0.030s
demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 0 1
with malloc: 0, with mbrtowc: 0, with utftouni: 1

real    0m0.110s
user    0m0.070s
sys     0m0.030s
demsn702[C.UTF-8]@stbm8186:/H/tools:

The conclusion is, as long as calling mbrtowc is as inefficient, a 
program caring about performance should not use it.

Thomas
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: uu.c
URL: <http://cygwin.com/pipermail/cygwin/attachments/20091106/e2118aa1/attachment.c>
-------------- next part --------------
--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple