This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: printing wchar_t*

From: "Jim Blandy" <jimb at red-bean dot com>
To: "Eli Zaretskii" <eliz at gnu dot org>
Cc: ghost at cs dot msu dot su, gdb at sources dot redhat dot com
Date: Fri, 14 Apr 2006 12:16:36 -0700
Subject: Re: printing wchar_t*
References: <e1lsqg$aml$1@sea.gmane.org> <200604141257.41690.ghost@cs.msu.su> <uu08w1cnf.fsf@gnu.org> <200604141837.26618.ghost@cs.msu.su> <uirpc19u8.fsf@gnu.org> <8f2776cb0604141053v73e512e3o2d1c9086312316bd@mail.gmail.com> <ubqv4108c.fsf@gnu.org>

On 4/14/06, Eli Zaretskii <eliz@gnu.org> wrote:
> > Suppose we have a wide string where wchar_t values are Unicode code
> > points.  Suppose our host character set is plain ASCII.  Suppose the
> > user's program has a string containing the digits '123', followed by
> > some funky Tibetan characters U+0F04 U+0FCC, followed by the letters
> > 'xyz'.  When asked to print that string, GDB should print the
> > following twenty-one ASCII characters:
> >
> > L"123\x0f04\x0fccxyz"
>
> This will work, if we accept your assumptions (which are by no means
> universally correct, e.g. parts of our discussion were around whether
> the string contains U+XXXX Unicode codepoints or their UTF-16
> encodings).  But all you did is invent an encoding (and a
> variable-size encoding at that).  Something in the GUI FE still has to
> interpret that encoding, i.e. convert it back to binary representation
> of the characters, because your encoding cannot be displayed by any
> known GUI API.

The command line and MI already use the ISO C syntax for conveying
values to the user/consumer.  I'm just saying we should expand our use
of the syntax we already use.

I posited that the target character set was Unicode, but the same
mechanism will work no matter what character set and encoding the
target uses.  No matter what string appears on the target, there is
always a source-language representation for that target.  According to
ISO C, the \x escapes specify char or wchar_t values in the target
character set.  So you can always write whatever you've got.

> Compare this with the facility that we already have today:
>
>  (gdb) print *warray@8
>   {0x0031, 0x0032, 0x0033, 0x0F04, 0x0FCC, 0x0078, 0x0079, 0x007A}
>
> Except for using up 60-odd characters where you used 21, this is IMHO
> better, since it doesn't require any code on the FE side: just convert
> the strings to integers, and you've got Unicode, ready to be used for
> whatever purposes.

If you're printing an expression that evaluates to a string, sure. 
But what if you're printing a value of type struct { wchar *key;
wchar_t *value }?  What if you're using -stack-list-arguments to show
values in a stack frame?

My point is, MI consumers are already parsing ISO C strings.  They
just need to parse more of them.

> > Since this is a valid way to write that string in a source program, a
> > user at the GDB command line should understand it.  Since consumers of
> > MI information must contain parsers for C values already, they can
> > reliably find the contents of the string.
>
> I only partly agree with the first sentence, and not at all with the
> second.
>
> For the interactive user, understanding non-ASCII strings in the
> suggested ASCII encoding might not be easy at all.  For example, for
> all my knowledge of Hebrew, if someone shows me \x05D2, I will have
> hard time recognizing the letter Gimel.

If the host character set includes Gimel, then GDB won't print it with
a hex escape.

> As for the second sentence, ``reliably find the contents of the
> string'' there obviously doesn't consider the complexities of handling
> wide characters.  In my experience, for any non-trivial string
> processing, working with variable-size encoding is much harder than
> with fixed-size wchar_t arrays, because you need to interpret the
> bytes as you go, even if all you need is to find the n-th character.
> Even the simple task of computing the number of characters in the
> string becomes complicated.

I don't understand what you mean.  The rules for parsing ISO C string
literals into arrays of chars and wide string literals into arrays of
wide characters are straightforward.

> What you are suggesting is simple for GDB, but IMHo leaves too much
> complexity to the FE.  I think GDB could do better.  In particular, if
> I'm sitting at a UTF-8 enabled xterm, I'd be grateful if GDB would
> show me Unicode characters in their normal glyphs, which would require
> GDB to output the characters in their UTF-8 encoding (which the
> terminal will then display in human-readable form).  Your suggestion
> doesn't allow such a feature, AFAICS, at least not for CLI users.

When the host character set contains a character, there's no need for
GDB to use an escape to show it.

> If wchar_t uses fixed-size characters, not their variable-size
> encodings, then specifying the CCS will do.

There is no provision in ISO C for variable-size wchar_t encodings. 
The portion of the standard I referred to says that wchar_t "...is an
integer type whose range of values can represent distinct codes for
all members of the largest extended character set speciïed among the
supported locales".

Follow-Ups:
- Re: printing wchar_t*
  - From: Daniel Jacobowitz
- Re: printing wchar_t*
  - From: Eli Zaretskii

References:
- printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Eli Zaretskii
- Re: printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Eli Zaretskii
- Re: printing wchar_t*
  - From: Jim Blandy
- Re: printing wchar_t*
  - From: Eli Zaretskii

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]