On 24/02/10 04:17 AM, Corinna Vinschen wrote:
On Feb 23 16:14, Jeff Johnston wrote:
On 20/02/10 10:59 AM, Corinna Vinschen wrote:
AFAICS, there are two possible approaches to fix this problem:
- Store the charset not only for LC_CTYPE, but for each localization
category, and provide a function to request the charset.
This also requires to store the associated multibyte to widechar
conversion functions, obviously, and to call the correct functions
from wprintf and wcftime.
- Redefine the locale data structs so that they contain multibyte and
widechar representations of all strings. Use the multibyte strings
in the multibyte functions, the widechar strings in the widechar
functions.
This assumes that widechar representations from separate mbtowc
converters can be concatenated and be decoded by a single wctomb
converter. Without this ability, the concatenated widechar string
derived is of no use to anybody unless they know where the charset
changes occur.
IMO, this is "undefined behaviour".
I don't understand. The wide char representation is Unicode. Why
should it be a problem to use Unicode strings together, just because
they are from different sources? Even if wchar_t is UTF-16, as on
Cygwin, the strings are complete. There's no such thing as just one
half of a surrogate.
So, you are saying if I use the mbtowc for EUC-JP in current newlib and
concatenate that to UTF-16 widechar output and add mbtowc output for
SJIS, a user can simply call wctomb() in newlib and have it pull it all
apart again? This obviously won't work for the old eucjp and sjis
versions of mbtowc/wctomb that Cygwin doesn't currently use, but even
so, I still see 3 versions of wctomb (utf8, iso, and cp) that apply to
Cygwin inside wctomb_r. Am I missing something? How can one of these
functions handle all types of wchar input?
If one cannot take the concatenated string and pass it to a single
internal version of the wctomb() function (i.e. the user has to call 3
versions of wctomb for different charsets), then the user has to know
where each section begins in the full string which makes the end-result
of little use and thus not worth supporting.
The advantage of having the strings available in wchar_t representation
would be that the wcsftime and wprintf functions don't have to worry
about charsets at all. In contrast to the current solution which
requires a conversion from multibyte which means, you have to *know*
which source charset was being used when creating these strings. Right
now they only have information about one charset, which is the LC_CTYPE
charset.
In Glibc, as well as on Windows, the localization strings are originally
stored in Unicode on disk, and Glibc stores the strings internally in
multibyte
and wchar_t representation. When Cygwin fetches the strings from Windows
it has to convert them to multibyte since there is no wchar_t slot for
the data, and following POSIX, it has to store them in the charset given
for the locale category, LC_TIME, LC_MESSAGES, etc.