This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: printing wchar_t*

From: Vladimir Prus <ghost at cs dot msu dot su>
To: "Jim Blandy" <jimb at red-bean dot com>
Cc: gdb at sources dot redhat dot com
Date: Fri, 14 Apr 2006 11:57:14 +0400
Subject: Re: printing wchar_t*
References: <e1lsqg$aml$1@sea.gmane.org> <e1necb$gen$1@sea.gmane.org> <8f2776cb0604140029r44decd6atfa728aad53cb596d@mail.gmail.com>

On Friday 14 April 2006 11:29, Jim Blandy wrote:
> On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > Jim Blandy wrote:
> > > On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > >> I have a user-defined command that can produce the output I want, but
> > >> is defining a custom command the right approach?
> > >
> > > Well, you'd like wide strings to be printed properly when they appear
> > > in structures, as arguments to functions, and so on, right?  So a
> > > user-defined command isn't ideal.
> >
> > I think I'll still need to do some processing for wchar_t* on frontend
> > side. The problem is that I don't see any way how gdb can print wchar_t
> > in a way that does not require post-processing. It can print it as UTF8,
> > but then for printing char* gdb should use local 8 bit encoding, which is
> > likely to be *not* UTF8. Gdb can probably use some extra markers for
> > values: like:
> >
> >    "foo"  for string in local 8-bit encoding
> >    L"foo" for string in UTF8 encoding.
> >
> > It's also possible to use "\u" escapes.
> >
> > But then there's a problem:
> >
> >    - Do we assume that wchar_t is always UTF-16 or UTF-32?
> >    - If not:
> >      - how user can select this?
> >      - how user-specified encoding will be handled
>
> You can't hard-code assumptions about the character set into GDB.  Nor
> can you hard-code the assumption that the host and target character
> sets are the same.  GDB needs to do explicit conversions between the
> two as needed, and handle mismatches in some reasonable way.
>
> GDB already has the commands 'set host-charset' and 'set
> target-charset', so you can assume that you have accurate information
> about the character sets at hand.  They fall back to ASCII.

Good, but you need to separately set host-charset for char* and for wchar_t*.
The first can be KOI8-R and the second can be UTF-32 in the same program at 
the same time.

> > > The best approach would be to extend charset.[ch] to handle wide
> > > character sets as well, and then add code to the language-specific
> > > printing routines to use the charset functions.  (This is fortunately
> > > much simpler than adding support for multibyte characters.)
> >
> > For, for each wchar_t element language-specific code will call
> > 'target_wchar_t_to_host', that will output specific representation of
> > that wchar_t. Hmm, the interface there seem to assume theres 1<->1
> > mapping between target and host characters.  This makes L"UTF8" format
> > and ascii string with \u escapes format impossible, It seems.
>
> Not at all.  The current character and string printing code uses those
> routines, and it handles unprintable and invalid characters just fine.
>  See, for example, host_print_char_literally, and
> c_target_char_has_backslash_escape.

Can this code output using UTF8-encoding? Consider this code from c-lang.c:

  static void
  c_emit_char (int c, struct ui_file *stream, int quoter)
 {
  const char *escape;
  int host_char;

  c &= 0xFF;			/* Avoid sign bit follies */

  escape = c_target_char_has_backslash_escape (c);
  if (escape)
    {
      if (quoter == '"' && strcmp (escape, "0") == 0)
	/* Print nulls embedded in double quoted strings as \000 to
	   prevent ambiguity.  */
	fprintf_filtered (stream, "\\000");
      else
	fprintf_filtered (stream, "\\%s", escape);
    }
  else if (target_char_to_host (c, &host_char)
           && host_char_print_literally (host_char))
    {
      if (host_char == '\\' || host_char == quoter)
        fputs_filtered ("\\", stream);
      fprintf_filtered (stream, "%c", host_char);
    }
  else
    fprintf_filtered (stream, "\\%.3o", (unsigned int) c);
 }

With UTF8 host encoding, we'd want up to 6 host bytes to be output for a 
single wchar_t, without escaping. The size of 'host_char' here is 4 bytes, so 
there's no way for 'target_char_to_host' to produce 6 characters. 

> As far as 1-to-1 mappings are concerned, the only necessary property
> is that host_char_to_target and target_char_to_host be inverses, and
> return zero for characters that can't make a round trip.  The existing
> string-printing code will automatically use numeric escapes for
> characters that target_char_to_host won't translate.

So, assuming numeric escapes are fine with me, I'd need to:

  1. Add a way to specify encoding of wchar_t* values.
  2. Write a version of c_printstr that will handle wchar_t*. The current
     version just accesses i-th element of the string, so won't work with
     UTF-16.
  3. Make new c_printstr call LA_PRINT_CHAR just like c_printstr, and that
     will handle escapes automatically.
  4. Make sure new version of c_printstr is invoked for wchar_t* values.

Is that about right?

- Volodya

References:
- printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Jim Blandy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]