This is the mail archive of the cygwin-apps@cygwin.com mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: libgetopt++ and setup and libstdc++

From: "Robert Collins" <robert dot collins at itdomain dot com dot au>
To: "Gary R. Van Sickle" <g dot r dot vansickle at worldnet dot att dot net>,"Cygwin-Apps" <cygwin-apps at cygwin dot com>
Date: Wed, 1 May 2002 18:22:03 +1000
Subject: RE: libgetopt++ and setup and libstdc++



> -----Original Message-----
> From: Gary R. Van Sickle [mailto:g.r.vansickle@worldnet.att.net] 
> Sent: Monday, April 29, 2002 5:39 AM

> > Except that widechar != unicode. WCHAR is still an 0 terminated 
> > string, but Unicode strings are not 0 terminated.
> 
> Sure they are.  A Unicode '\0' == 0x0000 (regardless of your 
> byte order ;-)). 

Read http://www.unicode.org/unicode/uni2book/ch05.pdf section 5.2.
Also read http://www.unicode.org/unicode/uni2book/ch02.pdf which does
note that nul(U+0000) can be used as a string terminator.

Then http://www.unicode.org/unicode/reports/tr17/
"C and C++ char* APIs use serialized bytes, which could represent a
variety of different character maps, including ISO Latin 1, UTF-8,
Windows 1252, as well as compound character maps such as Shift-JIS or
2022-JP. A byte API could also handle UTF-16BE or UTF-16LE, which are
serialized forms of Unicode. However, these APIs must be allow for the
existence of any byte value, and typically use memcpy plus length
instead of strcpy for manipulating strings." (which is possibly
referring to a non-wchar_t aware strcpy, not sure here).

Anyway, things like UTF-8 can confuse the heck out of c-libraries
because of their multi-byte nature, where
a) a NULL may be part way through a chacter, not terminating, and
b) a NULL may be illegal at a given point, and the previous partial
character is invalid.

Finally, note that Unicde requires 21 bits of storage, so a 16 bit WCHAR
will still involve multi-byte sequence.

Does the newlib && lib-gcc and libstdc++ string <WCHAR> correctly
understand unicode (and what representation does it use?). Does it use
the same as Win32 WCHAR does? 

> > (See the NT kernel defines for
> > UNICODE_STRING to see how unicode strings are represented.).
> >
> 
> Right, but we're not in Kernel space here (thank the three 
> men I admire most, the Father, Son, and the Holy Ghost!).  
> The UNICODE Win32 API takes 
> null-terminated UNICODE strings as parameters.

UNICODE isn't a storage format. 8-bit strings can represent Unicode as
well.... (see above).
 

> Oh, I completely agree.  I'm just saying that we could 
> probably base it on basic_string< TCHAR > and be one step 
> ahead of the future.

Well, the encapsulation makes such future changes (relatively) painless,
and I'm not convinced (yet) that wchar_t or <TCHAR> are appropriate.

Rob

Follow-Ups:
- Re[2]: libgetopt++ and setup and libstdc++
  - From: Pavel Tsekov

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]