This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: printing wchar_t*


> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Fri, 14 Apr 2006 18:37:25 +0400
> Cc: gdb@sources.redhat.com
> 
> > Now, the same letter ``small a'' can be encoded in several other ways:
> > for example, its ISO-2022-7bit encoding is 0x1B 0x24 0x2C 0x31 0x28
> > 0x50, its KOI8-r encoding is 0xC1, its ISO-8859-5 encoding is 0xD0,
> > etc.  It should be obvious that, of all the encodings, only the
> > fixed-length ones can be used in a wchar_t array (because wchar_t
> > arrays are stateless, 
> 
> I don't think this statement is backed up by anything.
> 
> > This is why I said that wchar_t is not used for an encoding (such as
> > ISO-8859-5 or UTF-8 or UTF-16), but for characters' codepoints.  It is
> > nowadays almost universally accepted that wchar_t is a Unicode
> > codepoint, 
> 
> Again, can you provide any specific pointers to support that view?

I think Robert and myself already explained that in later messages.
Feel free to ask specific questions if something is still unclear.

> I believe that on Windows:
> 
> - wchar_t is 16-bit
> - wchar_t* values are supposed to be in UTF-16 encoding
> (see    
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_9i79.asp
> 
> Do you disagree with any of the above statements?

wchar_t is just an integer type.  You can stuff _anything_ into an
integer array, but if you put UTF-16 there, each element is no longer
a character, it is one of a few 16-bit integers that encode a
character.  In other words, it's a variant of multibyte strings,
except that each element is 16-bit wide.

Now, I know that Windows holds 16-bit UTF-16 encodings in wchar_t
arrays, but that is not the L"foo" strings of wide characters.  In the
L"foo" notation, each of the 3 string characters _always_ occupies
exactly one wchar_t element, and L"foo"[1] is _always_ the second
character of the string.  This is not true for UTF-16, as I hope is
clear from this discussion.  In UTF-16, array[1] is the second 16-bit
value that encodes a character, and that character's encoding could
need more than 1 16-bit value.

> If not, then it directly 
> follows that a given wchar_t is not a Unicode code point, but a code unit in 
> specific representation (UTF-16), and a given code points takes either one or 
> two code units, that is either one or two wchar_t. This is contrary to your 
> statement that wchar_t is a single code point.

My statement was based on the assumption that you are coding for a
system where wchar_t is used for complete characters, not for UTF-16
strings.  Only in that case, you can talk about ``wide characters''
and about wchar_t being a character.  In UTF-16, an arbitrary element
of the array might not be a complete character.

> Anyway, this is quickly getting off-topic for gdb list, so maybe we should 
> bring this somewhere else.

It _is_ on topic, IMHO, as long as we discuss features to be added to
GDB.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]