This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: codeset problems in wprintf and wcsftime

From: Jeff Johnston <jjohnstn at redhat dot com>
To: newlib at sourceware dot org
Date: Wed, 24 Feb 2010 17:10:00 -0500
Subject: Re: codeset problems in wprintf and wcsftime
References: <20100220155935.GK5683@calimero.vinschen.de> <4B84452D.9040508@redhat.com> <20100224091734.GA24997@calimero.vinschen.de> <4B859772.6070201@redhat.com>

On 24/02/10 04:17 PM, Jeff Johnston wrote:

On 24/02/10 04:17 AM, Corinna Vinschen wrote:

On Feb 23 16:14, Jeff Johnston wrote:

On 20/02/10 10:59 AM, Corinna Vinschen wrote:

AFAICS, there are two possible approaches to fix this problem:

- Store the charset not only for LC_CTYPE, but for each localization
category, and provide a function to request the charset.
This also requires to store the associated multibyte to widechar
conversion functions, obviously, and to call the correct functions
from wprintf and wcftime.

- Redefine the locale data structs so that they contain multibyte and
widechar representations of all strings. Use the multibyte strings
in the multibyte functions, the widechar strings in the widechar
functions.


This assumes that widechar representations from separate mbtowc
converters can be concatenated and be decoded by a single wctomb
converter. Without this ability, the concatenated widechar string
derived is of no use to anybody unless they know where the charset
changes occur.

IMO, this is "undefined behaviour".


I don't understand. The wide char representation is Unicode. Why
should it be a problem to use Unicode strings together, just because
they are from different sources? Even if wchar_t is UTF-16, as on
Cygwin, the strings are complete. There's no such thing as just one
half of a surrogate.


So, you are saying if I use the mbtowc for EUC-JP in current newlib and
concatenate that to UTF-16 widechar output and add mbtowc output for
SJIS, a user can simply call wctomb() in newlib and have it pull it all
apart again? This obviously won't work for the old eucjp and sjis
versions of mbtowc/wctomb that Cygwin doesn't currently use, but even
so, I still see 3 versions of wctomb (utf8, iso, and cp) that apply to
Cygwin inside wctomb_r. Am I missing something? How can one of these
functions handle all types of wchar input?

If one cannot take the concatenated string and pass it to a single
internal version of the wctomb() function (i.e. the user has to call 3
versions of wctomb for different charsets), then the user has to know
where each section begins in the full string which makes the end-result
of little use and thus not worth supporting.

Never mind. Let me retract that. I get it now.

The advantage of having the strings available in wchar_t representation
would be that the wcsftime and wprintf functions don't have to worry
about charsets at all. In contrast to the current solution which
requires a conversion from multibyte which means, you have to *know*
which source charset was being used when creating these strings. Right
now they only have information about one charset, which is the LC_CTYPE
charset.

In Glibc, as well as on Windows, the localization strings are originally
stored in Unicode on disk, and Glibc stores the strings internally in
multibyte
and wchar_t representation. When Cygwin fetches the strings from Windows
it has to convert them to multibyte since there is no wchar_t slot for
the data, and following POSIX, it has to store them in the charset given
for the locale category, LC_TIME, LC_MESSAGES, etc.

Under those circumstances, it seems a reasonable strategy for Cygwin regardless of the multiple charset support.

I think one could optionally flag an error either in the setlocale
routine or the wprintf routines themselves.


Well, if the conversion doesn't work, vfwprintf just falls back to the
defaults for the C locale and switches off grouping. That's probably
the sanest thing to do.
If wcsftime fails to convert the format string it returns 0, which is
the defined error behaviour. In case of the new era and alt_digits
strings (http://sourceware.org/ml/newlib/2010/msg00153.html), it will
fail to store the era and alt_digits information and fall back to the
default behaviour: %EC -> %C, %EY -> %Y, %OH -> %H, etc.

That's probably ok, given the POSIX-1.2008 quote given by Andy in
http://sourceware.org/ml/newlib/2010/msg00146.html
I just hoped we could do better.

Corinna

Follow-Ups:
- Re: codeset problems in wprintf and wcsftime
  - From: Corinna Vinschen

References:
- codeset problems in wprintf and wcsftime
  - From: Corinna Vinschen
- Re: codeset problems in wprintf and wcsftime
  - From: Jeff Johnston
- Re: codeset problems in wprintf and wcsftime
  - From: Corinna Vinschen
- Re: codeset problems in wprintf and wcsftime
  - From: Jeff Johnston

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]