This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file


Corinna Vinschen wrote:
On Nov 6 16:00, Thomas Wolff wrote:
...
I extended your test program to demonstrate the inefficiency of the
standard mbrtowc function. [...]
I later had to correct:
Anyway, corrected results are still by a factor of 3 to 4 in favor of my algorithm.
Corinna wrote:
That's sort of an unfair test. Your utftouni function doesn't care for
mbstate, error, and surrogate pair handling.
This is a question of use cases:
* mbstate is needed e.g. if you feed results of read() which possibly come in arbitrary chunks directly into mbtowc(); it's not needed if you only transform complete lines of text at once. The stdlib function is a little bit too generic (and thus complicated, too) for many applications.
* error handling is there, in my function; it's simplified, incorrect sequences are all mapped to 0 for the test case but they could as well return an error indication without performance impact.
* surrogate pair handling is only needed if you pass the string from/to the Windows API. It's not needed for POSIX applications (provided wchar_t would be sufficiently wide). So if wchar_t can be extended in the newlib API, it might be useful to have two implementations; one for applications (w/o surrogates), one for cygwin itself.


Having said that, I just experimented further with mbrtowc, and I was
able to speed up mbrtowc and wcrtomb calls on Cygwin by a factor of
almost 50 per cent, just by reducing the function call depth in newlib,
which is the result of reentrancy and isolation efforts.
Great! That comes close to my corrected results :-[

Talking about your implementation, if you could come up with a faster
implementation of newlib's __utf8_wctomb/__utf8_mbtowc, it would
certainly be another welcome performance boost.
A quick look at those function doesn't reveal much potential, except for tiny optimizations like
- if (ch >= 0xe0 && ch <= 0xef) /* three-byte sequence */
+ if (ch & 0xf0 == 0xe0) /* three-byte sequence */
But even that, given the way the compiler optimizes expressions, is probably not an improvement.


Also, I remember some recent trouble was fixed by your tweaking of wide character functions, so this is better not touched again.

My main point was that, depending on the use case, some applications would be better off using less generic, optimized functions.
grep and sed would certainly be well advised to do that.


Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]