This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

bugs in CP1258 converter (1)

To: libc-alpha at sources dot redhat dot com
Subject: bugs in CP1258 converter (1)
From: Bruno Haible <haible at ilog dot fr>
Date: Tue, 15 May 2001 15:24:51 +0200 (CEST)

Hi,

CP1258 is an encoding designed for Vietnamese. However, the glibc iconv
converter for this encoding cannot convert even the simplest Vietnamese
text, here: the greeting taken from the Emacs HELLO file.

$ printf 'Vietnamese (Ti\xe1\xba\xbfng Vi\xe1\xbb\x87t)\tCh\xc3\xa0o b\xe1\xba\xa1n\n' > viet-greeting.txt
$ iconv -f UTF-8 -t CP1258 < viet-greeting.txt > /dev/null
iconv: illegal input sequence at position 14

CP1258 is particular among the 8-bit encodings: it contains five combining
characters
U+0300 COMBINING GRAVE ACCENT
U+0301 COMBINING ACUTE ACCENT
U+0303 COMBINING TILDE
U+0309 COMBINING HOOK ABOVE
U+0323 COMBINING DOT BELOW

In the CP1258 encoding, the combining characters follow the base character,
like in Unicode. But CP1258 also contains some precomposed characters. For
example, the "e" in "Viet" above is

U+1EC7 LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW

which in CP1258 is represented as

U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX
U+0323 COMBINING DOT BELOW

But note that CP1258 does *not* have the combining CIRCUMFLEX.

For this reason, the Unicode to CP1258 converter must partially decompose
some precomposed characters.

And the CP1258 to Unicode converter must produce composed, not decomposed
output, because Normalisation Form C is the standard for text interchange.
So this conversion direction must combine U+00EA U+0323 to a single Unicode
character. This also achieves round-trip compatibility for
Unicode -> CP1258 -> Unicode conversion.

The only way to implement this CP1258 to Unicode converter is as a stateful
converter. (If it were a stateless encoding, it could not convert 0xEA 0xF2
to U+1EC7 but lone 0xEA to U+00EA.)

But it turns out that the converter structure in iconv/skeleton.c is not
prepared for stateful encodings whose EMIT_SHIFT_TO_INIT macro outputs
something in the FROM_DIRECTION, only for those where output occurs in the
TO_DIRECTION.

So here are two patches:
(2) A fix to iconv/skeleton.c to cope with such encodings.
(3) The CP1258 converter and associated tests.

A similar patch for CP1255 will follow later.

Bruno

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]