This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string


http://sourceware.org/bugzilla/show_bug.cgi?id=13637

             Bug #: 13637
           Summary: incorrect match in multi-byte (non-UTF8) string
           Product: glibc
           Version: 2.15
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
        AssignedTo: drepper.fsp@gmail.com
        ReportedBy: leonardo@ngdn.org
    Classification: Unclassified


Created attachment 6186
  --> http://sourceware.org/bugzilla/attachment.cgi?id=6186
reg.sh: a script to reproduce the problem

When a special string composed of single and multi-byte characters is passed to
re_search(), the function seems to lose track of which characters are
multi-byte and returns an incorrect match. This seems to be exclusive to the
ja_JP.eucjp locale.

The problem can be reproduced when the following string:

  aaa\xb7\xefa\xbf\xb7\xbd\xe8

... is matched against the pattern:

  \xb7\xbd

The two bytes in the pattern are respectively "the last byte of the second
multi-byte char" and "the first byte of the third multi-byte char" in the
original string.

The number of "a"s prefixed in the original string seems to make all the
difference here. I could only reproduce the problem when exactly 3 or 4 "a"s
are prefixed. I.e., if you remove one "a" from the prefix of the original
string:

  aa\xb7\xefa\xbf\xb7\xbd\xe8

... the problem no longer happens.

I'm attaching a script that reproduces the problem. The 'sed' version I'm using
is compiled with "--without-included-regex", so it should use glibc's regex
functions. Unfortunately I can't affirm yet that the bug is not in sed, but I'm
trying to create a self contained program to demonstrate the problem.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]