This is the mail archive of the
cygwin-apps@cygwin.com
mailing list for the Cygwin project.
Re[2]: libgetopt++ and setup and libstdc++
- From: Pavel Tsekov <ptsekov at syntrex dot com>
- To: "Robert Collins" <robert dot collins at itdomain dot com dot au>
- Cc: "Gary R. Van Sickle" <g dot r dot vansickle at worldnet dot att dot net>, "Cygwin-Apps" <cygwin-apps at cygwin dot com>
- Date: Wed, 1 May 2002 10:53:05 +0200
- Subject: Re[2]: libgetopt++ and setup and libstdc++
- Organization: Syntrex, Inc.
- References: <FC169E059D1A0442A04C40F86D9BA7600C5F64@itdomain003.itdomain.net.au>
- Reply-to: Pavel Tsekov <ptsekov at syntrex dot com>
Hello Robert,
Wednesday, May 01, 2002, 10:22:03 AM, you wrote:
>> -----Original Message-----
>> From: Gary R. Van Sickle [mailto:g.r.vansickle@worldnet.att.net]
>> Sent: Monday, April 29, 2002 5:39 AM
>> > Except that widechar != unicode. WCHAR is still an 0 terminated
>> > string, but Unicode strings are not 0 terminated.
>>
>> Sure they are. A Unicode '\0' == 0x0000 (regardless of your
>> byte order ;-)).
>>
Zero terminated strings (C style strings) has nothing to do with the
basic_string template class. basic_string can contain any character
including \0. Its much the same as the STL vector. The WCHAR here
specifies the size of storage of a single character...
I.e. you can have typedef basic_string<struct SomeStrangeChar>
SomeStrangeCharString;
RC> Read http://www.unicode.org/unicode/uni2book/ch05.pdf section 5.2.
RC> Also read http://www.unicode.org/unicode/uni2book/ch02.pdf which does
RC> note that nul(U+0000) can be used as a string terminator.
RC> Then http://www.unicode.org/unicode/reports/tr17/
RC> "C and C++ char* APIs use serialized bytes, which could represent a
RC> variety of different character maps, including ISO Latin 1, UTF-8,
RC> Windows 1252, as well as compound character maps such as Shift-JIS or
RC> 2022-JP. A byte API could also handle UTF-16BE or UTF-16LE, which are
RC> serialized forms of Unicode. However, these APIs must be allow for the
RC> existence of any byte value, and typically use memcpy plus length
RC> instead of strcpy for manipulating strings." (which is possibly
RC> referring to a non-wchar_t aware strcpy, not sure here).
RC> Anyway, things like UTF-8 can confuse the heck out of c-libraries
RC> because of their multi-byte nature, where
RC> a) a NULL may be part way through a chacter, not terminating, and
RC> b) a NULL may be illegal at a given point, and the previous partial
RC> character is invalid.
RC> Finally, note that Unicde requires 21 bits of storage, so a 16 bit WCHAR
RC> will still involve multi-byte sequence.
Quote from "The C++ Programming Language":
"A wide character - that is, an object of type wchar_t ($4.3) - is
like a char, except that it take up two or more bytes."
RC> Does the newlib && lib-gcc and libstdc++ string <WCHAR> correctly
RC> understand unicode (and what representation does it use?). Does it use
RC> the same as Win32 WCHAR does?
>> > (See the NT kernel defines for
>> > UNICODE_STRING to see how unicode strings are represented.).
Btw I read somewhere else that Windows does not support the full
japanese characterset, but only the most used characters.