This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [RFC] PR 15873 UTF-8 incoplete/invalid chars go unnoticed



> -----Message d'origine-----
> De : gdb-patches-owner@sourceware.org [mailto:gdb-patches-
> owner@sourceware.org] De la part de Yao Qi
> Envoyà : vendredi 20 septembre 2013 05:07
> Ã : Pierre Muller
> Cc : gdb-patches@sourceware.org
> Objet : Re: [RFC] PR 15873 UTF-8 incoplete/invalid chars go unnoticed
> 
> On 08/21/2013 10:55 PM, Pierre Muller wrote:
> >    This test was failing a lot for mingw built GDB,
> > and while trying to understand why this was not the case on linux,
> > I noticed that the test was only completing successfully because
> > UTF-8 is the default target-charset on the linux system I tested.
> 
> Pierre,
> Yes, I saw these fails too.  I build GDB with and without libiconv,
> these fails are there.  I spent some time on it, but had no clue so far.
> Looks like it is about encoding conversion, but I don't have any
> knowledge on it.

  The problem is quite complex:
1) The test originally relied on the 
"set sevenbit_strings on"
command, but this sets the variable
is currently ignored in valprint.c:print_wchar function.

2) The test seems to complete without errors on linux, but
this is due to at least two wrong reasons:
  - linux uses UTF-8 as default character set usually,
(gdb) show charset
The host character set is "auto; currently UTF-8".
The target character set is "auto; currently UTF-8".
The target wide character set is "auto; currently UTF-16".

and all binary values from 128 to 255 are not valid for this set.
This is the reason why those values from 128 to 255 are displayed as
octals (generating PASSes).
  But in fact this values are either the first part of a multi-byte character sequence or
invalid.
  The octal value '\340' for instance is start of a multi-byte character.
but current gdb gives you this output:
(gdb) p '\340'
$6 = -32 '\340'
(gdb) p "\340"
$7 = <incomplete sequence \340>
(gdb)
  This means that character printing is not consistent with string printing...

'\240' is never valid as a first byte of a multi-byte sequence,
(gdb) p '\240'
$10 = -96 '\240'
(gdb) p "\240"
$11 = "\240"
so here neither char nor string printing gives the information about non-valid
byte.

So that another part of my patch is to:
  - state incomplete also for char.
  -state invalid for both char and string printing.


To complicate things even more...
Even after my patch, the incomplete sequence marker seems to generate
a problem...
(gdb) p "ABC\240"
$1 = "ABC", <invalid sequence \240>
(gdb) p "ABC\340"
$2 = "AB", <incomplete sequence \340>
(gdb)
Note that the letter 'C' disappeared when followed by \340...
Using iconv on similar sequence, I do see the 'C' letter before the error is stated...
but when I tried to debug this inside GDB, it seems that the
wchar_iterate_ok only came for letters 'A' and 'B', followed by a wchar_iterate_incomplete ....


I hope that I explained a little better the problems,

Pierre Muller



> >    The patch below adds a <invalid>/<incomplete> marker
> > to 1-byte chars that are not valid in UTF-8.
> >
> >    This means that this patch will create regressions
> > in testsuite runs, but I think that it's the
> > test that is wrong, not my patch.
> 
> What is the point of this patch?  We usually post patch to fix a certain
> bug.  If we are unable to fix them, we can open a PR (like what you
> did), and copy all the fails in the PR to track them.
> 
> --
> Yao (éå)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]