This is the mail archive of the
gdb@sourceware.org
mailing list for the GDB project.
Re: printing wchar_t*
On Friday 14 April 2006 11:29, Jim Blandy wrote:
> On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > Jim Blandy wrote:
> > > On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > >> I have a user-defined command that can produce the output I want, but
> > >> is defining a custom command the right approach?
> > >
> > > Well, you'd like wide strings to be printed properly when they appear
> > > in structures, as arguments to functions, and so on, right? So a
> > > user-defined command isn't ideal.
> >
> > I think I'll still need to do some processing for wchar_t* on frontend
> > side. The problem is that I don't see any way how gdb can print wchar_t
> > in a way that does not require post-processing. It can print it as UTF8,
> > but then for printing char* gdb should use local 8 bit encoding, which is
> > likely to be *not* UTF8. Gdb can probably use some extra markers for
> > values: like:
> >
> > "foo" for string in local 8-bit encoding
> > L"foo" for string in UTF8 encoding.
> >
> > It's also possible to use "\u" escapes.
> >
> > But then there's a problem:
> >
> > - Do we assume that wchar_t is always UTF-16 or UTF-32?
> > - If not:
> > - how user can select this?
> > - how user-specified encoding will be handled
>
> You can't hard-code assumptions about the character set into GDB. Nor
> can you hard-code the assumption that the host and target character
> sets are the same. GDB needs to do explicit conversions between the
> two as needed, and handle mismatches in some reasonable way.
>
> GDB already has the commands 'set host-charset' and 'set
> target-charset', so you can assume that you have accurate information
> about the character sets at hand. They fall back to ASCII.
Good, but you need to separately set host-charset for char* and for wchar_t*.
The first can be KOI8-R and the second can be UTF-32 in the same program at
the same time.
> > > The best approach would be to extend charset.[ch] to handle wide
> > > character sets as well, and then add code to the language-specific
> > > printing routines to use the charset functions. (This is fortunately
> > > much simpler than adding support for multibyte characters.)
> >
> > For, for each wchar_t element language-specific code will call
> > 'target_wchar_t_to_host', that will output specific representation of
> > that wchar_t. Hmm, the interface there seem to assume theres 1<->1
> > mapping between target and host characters. This makes L"UTF8" format
> > and ascii string with \u escapes format impossible, It seems.
>
> Not at all. The current character and string printing code uses those
> routines, and it handles unprintable and invalid characters just fine.
> See, for example, host_print_char_literally, and
> c_target_char_has_backslash_escape.
Can this code output using UTF8-encoding? Consider this code from c-lang.c:
static void
c_emit_char (int c, struct ui_file *stream, int quoter)
{
const char *escape;
int host_char;
c &= 0xFF; /* Avoid sign bit follies */
escape = c_target_char_has_backslash_escape (c);
if (escape)
{
if (quoter == '"' && strcmp (escape, "0") == 0)
/* Print nulls embedded in double quoted strings as \000 to
prevent ambiguity. */
fprintf_filtered (stream, "\\000");
else
fprintf_filtered (stream, "\\%s", escape);
}
else if (target_char_to_host (c, &host_char)
&& host_char_print_literally (host_char))
{
if (host_char == '\\' || host_char == quoter)
fputs_filtered ("\\", stream);
fprintf_filtered (stream, "%c", host_char);
}
else
fprintf_filtered (stream, "\\%.3o", (unsigned int) c);
}
With UTF8 host encoding, we'd want up to 6 host bytes to be output for a
single wchar_t, without escaping. The size of 'host_char' here is 4 bytes, so
there's no way for 'target_char_to_host' to produce 6 characters.
> As far as 1-to-1 mappings are concerned, the only necessary property
> is that host_char_to_target and target_char_to_host be inverses, and
> return zero for characters that can't make a round trip. The existing
> string-printing code will automatically use numeric escapes for
> characters that target_char_to_host won't translate.
So, assuming numeric escapes are fine with me, I'd need to:
1. Add a way to specify encoding of wchar_t* values.
2. Write a version of c_printstr that will handle wchar_t*. The current
version just accesses i-th element of the string, so won't work with
UTF-16.
3. Make new c_printstr call LA_PRINT_CHAR just like c_printstr, and that
will handle escapes automatically.
4. Make sure new version of c_printstr is invoked for wchar_t* values.
Is that about right?
- Volodya