This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: BUG: %lc in printf fails with transliteration
- To: libc-alpha at sources dot redhat dot com
- Subject: Re: BUG: %lc in printf fails with transliteration
- From: Markus Kuhn <Markus dot Kuhn at cl dot cam dot ac dot uk>
- Date: Mon, 25 Sep 2000 17:20:55 +0100
Ulrich Drepper wrote on 2000-09-25 15:23 UTC:
> Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> writes:
>
> > It is disappointing to hear that the wc*tomb* cannot handle
> > transliteration.
>
> It is impossible with the wc*tomb* interface.
Surely not.
> wcrtomb and wcsrtombs
> don't get the size of the input buffer passed and therefore the user's
> allocation of a MB_CUR_MAX bytes sized buffer must be enough. But,
> for instance for ASCII, MB_CUR_MAX is 1 and therefore we cannot
> transliterate ö to oe.
The standard (§7.20) only says that
"MB_CUR_MAX [..]
expands to a positive integer expression with type
size_t that is the maximum number of bytes in a multibyte
character for the extended character set specified by the
current locale (category LC_CTYPE), which is never greater
than MB_LEN_MAX.
If a locale contains a transliteration L"ü" -> "ue", then for this
locale the implementation will have to make sure that MB_CUR_MAX >=
strlen("ue").
There are various ways to implement this:
a) The localedef command can determine the maximum
transliteration length for each locale and use this
value for MB_CUR_MAX.
b) Alternatively, you can impose a fixed limit on the maximally
allowed strlen of a transliteration substitution sequence. The number 4
sounds very reasonable to me, e.g. L"½" -> " 1/2" is usually the worst
case in all the UCS -> ASCII transliteration tables that I have seen.
Then simply make sure that MB_CUR_MAX >= 4 if transliteration is used
and that localedef enforces the maximum strlen of a locale substitution.
Transliteration is in every way just a multi-byte encoding in the sense
of ISO C. Its only notable difference to UTF-8 is that transliteration
is not a round-trip compatible mapping from of wide strings to byte
strings. But §5.2.1.2 does not require a multi-byte encoding to be
bijective, therefore a transliteration is a fully valid multi-byte
encoding that can be appropriately handled by wcrtomb and wcsrtombs,
just like UTF-8 and Shift-JIS.
> For well defined input. It is completely undefined what happens if
> you pass invalid input. And ö is invalid in the C locale.
The C standard does *not* specify what character set the "C" locale
supports. It only requires that the basic execution character set shall
at least contain all the characters listed in §5.2.1. ASCII, ISO 8859-1
and ISO 10646 are just three examples of a character sets that fulfill
this requirement. It is an unfortunate but widespread misconception that
the "C" locale has to be restricted to ASCII in some way. In fact, the
careful wording of the <ctype.h> function semantics in the standard
makes it quite clear that the authors had many different possible
character sets and repertoires for the "C" locale in mind.
It is highly desirable that the "C" locale treats L"ü" as a valid normal
character, otherwise wchar_t wouldn't be fixed to be always UCS coded.
Glibc defines __STDC_ISO_10646__, therefore the extended character set
of the C locale better be ISO 10646 and L"ü" better not be "invalid".
A locale such as C definitely can have various multi-byte coding
variants, just like de_DE can as well.
We can have multiple C locales:
C.UTF-8 (MB_CUR_MAX = 6)
C.ISO-8859-1 (should preferably use some transliteration, MB_CUR_MAX == 4)
C.ASCII (should preferably use even more transliteration,
e.g. L"ß" -> "ss", MB_CUR_MAX == 4)
It is a matter of personal taste, for which of the above locales the
more generic "C" should be an alias (just as with all the language/
region named locales). The current glibc uses "C" == "C.ASCII", which is
OK but somewhat arbitrary. I personally would prefer for the moment "C"
== "C.ISO-8859-1" and in a few years a switch-over to "C" == "C.UTF-8",
but there is nothing that I could say against your apparent current
personal preference for "C" == "C.ASCII".
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>