This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: codeset problems in wprintf and wcsftime

From: Corinna Vinschen <vinschen at redhat dot com>
To: newlib at sourceware dot org
Date: Wed, 24 Feb 2010 10:17:34 +0100
Subject: Re: codeset problems in wprintf and wcsftime
References: <20100220155935.GK5683@calimero.vinschen.de> <4B84452D.9040508@redhat.com>
Reply-to: newlib at sourceware dot org

On Feb 23 16:14, Jeff Johnston wrote:
> On 20/02/10 10:59 AM, Corinna Vinschen wrote:
> >AFAICS, there are two possible approaches to fix this problem:
> >
> >- Store the charset not only for LC_CTYPE, but for each localization
> >   category, and provide a function to request the charset.
> >   This also requires to store the associated multibyte to widechar
> >   conversion functions, obviously, and to call the correct functions
> >   from wprintf and wcftime.
> >
> >- Redefine the locale data structs so that they contain multibyte and
> >   widechar representations of all strings.  Use the multibyte strings
> >   in the multibyte functions, the widechar strings in the widechar
> >   functions.
> >
> 
> This assumes that widechar representations from separate mbtowc
> converters can be concatenated and be decoded by a single wctomb
> converter.  Without this ability, the concatenated widechar string
> derived is of no use to anybody unless they know where the charset
> changes occur.
> 
> IMO, this is "undefined behaviour".

I don't understand.  The wide char representation is Unicode.  Why
should it be a problem to use Unicode strings together, just because
they are from different sources?  Even if wchar_t is UTF-16, as on
Cygwin, the strings are complete.  There's no such thing as just one
half of a surrogate.

The advantage of having the strings available in wchar_t representation
would be that the wcsftime and wprintf functions don't have to worry
about charsets at all.  In contrast to the current solution which
requires a conversion from multibyte which means, you have to *know*
which source charset was being used when creating these strings.  Right
now they only have information about one charset, which is the LC_CTYPE
charset.

In Glibc, as well as on Windows, the localization strings are originally
stored in Unicode on disk, and Glibc stores the strings internally in multibyte
and wchar_t representation.  When Cygwin fetches the strings from Windows
it has to convert them to multibyte since there is no wchar_t slot for
the data, and following POSIX, it has to store them in the charset given
for the locale category, LC_TIME, LC_MESSAGES, etc.

> I think one could optionally flag an error either in the setlocale
> routine or the wprintf routines themselves.

Well, if the conversion doesn't work, vfwprintf just falls back to the
defaults for the C locale and switches off grouping.  That's probably
the sanest thing to do.
If wcsftime fails to convert the format string it returns 0, which is
the defined error behaviour.  In case of the new era and alt_digits
strings (http://sourceware.org/ml/newlib/2010/msg00153.html), it will
fail to store the era and alt_digits information and fall back to the
default behaviour:  %EC -> %C, %EY -> %Y, %OH -> %H, etc.

That's probably ok, given the POSIX-1.2008 quote given by Andy in
http://sourceware.org/ml/newlib/2010/msg00146.html
I just hoped we could do better.

Corinna

-- 
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat

Follow-Ups:
- Re: codeset problems in wprintf and wcsftime
  - From: Andy Koppe
- Re: codeset problems in wprintf and wcsftime
  - From: Jeff Johnston

References:
- codeset problems in wprintf and wcsftime
  - From: Corinna Vinschen
- Re: codeset problems in wprintf and wcsftime
  - From: Jeff Johnston

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]