This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: BUG: %lc in printf fails with transliteration


Ulrich Drepper wrote on 2000-09-25 15:23 UTC:
> Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> writes:
> 
> > It is disappointing to hear that the wc*tomb* cannot handle
> > transliteration.
> 
> It is impossible with the wc*tomb* interface.

Surely not.

> wcrtomb and wcsrtombs
> don't get the size of the input buffer passed and therefore the user's
> allocation of a MB_CUR_MAX bytes sized buffer must be enough.  But,
> for instance for ASCII, MB_CUR_MAX is 1 and therefore we cannot
> transliterate ö to oe.

The standard (§7.20) only says that

       "MB_CUR_MAX [..]
       expands to a positive integer expression with type
       size_t  that  is  the maximum number of bytes in a multibyte
       character for the extended character set  specified  by  the
       current  locale  (category LC_CTYPE), which is never greater
       than MB_LEN_MAX.

If a locale contains a transliteration L"ü" -> "ue", then for this
locale the implementation will have to make sure that MB_CUR_MAX >=
strlen("ue").

There are various ways to implement this:

  a) The localedef command can determine the maximum
     transliteration length for each locale and use this
     value for MB_CUR_MAX.

  b) Alternatively, you can impose a fixed limit on the maximally
     allowed strlen of a transliteration substitution sequence. The number 4
     sounds very reasonable to me, e.g. L"½" -> " 1/2" is usually the worst
     case in all the UCS -> ASCII transliteration tables that I have seen.
     Then simply make sure that MB_CUR_MAX >= 4 if transliteration is used
     and that localedef enforces the maximum strlen of a locale substitution.

Transliteration is in every way just a multi-byte encoding in the sense
of ISO C. Its only notable difference to UTF-8 is that transliteration
is not a round-trip compatible mapping from of wide strings to byte
strings. But §5.2.1.2 does not require a multi-byte encoding to be
bijective, therefore a transliteration is a fully valid multi-byte
encoding that can be appropriately handled by wcrtomb and wcsrtombs,
just like UTF-8 and Shift-JIS.

> For well defined input.  It is completely undefined what happens if
> you pass invalid input.  And ö is invalid in the C locale.

The C standard does *not* specify what character set the "C" locale
supports. It only requires that the basic execution character set shall
at least contain all the characters listed in §5.2.1. ASCII, ISO 8859-1
and ISO 10646 are just three examples of a character sets that fulfill
this requirement. It is an unfortunate but widespread misconception that
the "C" locale has to be restricted to ASCII in some way. In fact, the
careful wording of the <ctype.h> function semantics in the standard
makes it quite clear that the authors had many different possible
character sets and repertoires for the "C" locale in mind.

It is highly desirable that the "C" locale treats L"ü" as a valid normal
character, otherwise wchar_t wouldn't be fixed to be always UCS coded.
Glibc defines __STDC_ISO_10646__, therefore the extended character set
of the C locale better be ISO 10646 and L"ü" better not be "invalid".

A locale such as C definitely can have various multi-byte coding
variants, just like de_DE can as well.

We can have multiple C locales:

  C.UTF-8      (MB_CUR_MAX = 6)
  C.ISO-8859-1 (should preferably use some transliteration, MB_CUR_MAX == 4)
  C.ASCII      (should preferably use even more transliteration,
                e.g. L"ß" -> "ss", MB_CUR_MAX == 4)

It is a matter of personal taste, for which of the above locales the
more generic "C" should be an alias (just as with all the language/
region named locales). The current glibc uses "C" == "C.ASCII", which is
OK but somewhat arbitrary. I personally would prefer for the moment "C"
== "C.ISO-8859-1" and in a few years a switch-over to "C" == "C.UTF-8",
but there is nothing that I could say against your apparent current
personal preference for "C" == "C.ASCII".

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]