This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: codeset problems in wprintf and wcsftime


On 24/02/10 04:17 AM, Corinna Vinschen wrote:
On Feb 23 16:14, Jeff Johnston wrote:
On 20/02/10 10:59 AM, Corinna Vinschen wrote:
AFAICS, there are two possible approaches to fix this problem:

- Store the charset not only for LC_CTYPE, but for each localization
   category, and provide a function to request the charset.
   This also requires to store the associated multibyte to widechar
   conversion functions, obviously, and to call the correct functions
   from wprintf and wcftime.

- Redefine the locale data structs so that they contain multibyte and
   widechar representations of all strings.  Use the multibyte strings
   in the multibyte functions, the widechar strings in the widechar
   functions.


This assumes that widechar representations from separate mbtowc converters can be concatenated and be decoded by a single wctomb converter. Without this ability, the concatenated widechar string derived is of no use to anybody unless they know where the charset changes occur.

IMO, this is "undefined behaviour".

I don't understand. The wide char representation is Unicode. Why should it be a problem to use Unicode strings together, just because they are from different sources? Even if wchar_t is UTF-16, as on Cygwin, the strings are complete. There's no such thing as just one half of a surrogate.


So, you are saying if I use the mbtowc for EUC-JP in current newlib and concatenate that to UTF-16 widechar output and add mbtowc output for SJIS, a user can simply call wctomb() in newlib and have it pull it all apart again? This obviously won't work for the old eucjp and sjis versions of mbtowc/wctomb that Cygwin doesn't currently use, but even so, I still see 3 versions of wctomb (utf8, iso, and cp) that apply to Cygwin inside wctomb_r. Am I missing something? How can one of these functions handle all types of wchar input?


If one cannot take the concatenated string and pass it to a single internal version of the wctomb() function (i.e. the user has to call 3 versions of wctomb for different charsets), then the user has to know where each section begins in the full string which makes the end-result of little use and thus not worth supporting.


The advantage of having the strings available in wchar_t representation
would be that the wcsftime and wprintf functions don't have to worry
about charsets at all.  In contrast to the current solution which
requires a conversion from multibyte which means, you have to *know*
which source charset was being used when creating these strings.  Right
now they only have information about one charset, which is the LC_CTYPE
charset.

In Glibc, as well as on Windows, the localization strings are originally
stored in Unicode on disk, and Glibc stores the strings internally in multibyte
and wchar_t representation.  When Cygwin fetches the strings from Windows
it has to convert them to multibyte since there is no wchar_t slot for
the data, and following POSIX, it has to store them in the charset given
for the locale category, LC_TIME, LC_MESSAGES, etc.

I think one could optionally flag an error either in the setlocale
routine or the wprintf routines themselves.

Well, if the conversion doesn't work, vfwprintf just falls back to the defaults for the C locale and switches off grouping. That's probably the sanest thing to do. If wcsftime fails to convert the format string it returns 0, which is the defined error behaviour. In case of the new era and alt_digits strings (http://sourceware.org/ml/newlib/2010/msg00153.html), it will fail to store the era and alt_digits information and fall back to the default behaviour: %EC -> %C, %EY -> %Y, %OH -> %H, etc.

That's probably ok, given the POSIX-1.2008 quote given by Andy in
http://sourceware.org/ml/newlib/2010/msg00146.html
I just hoped we could do better.


Corinna




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]