This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: ISIRI-3342 converter broken


Kaixo!

Sorry for the delay.

On Mon, Jul 31, 2000 at 05:22:08PM +0200, Bruno Haible wrote:
> 
> The iconv converter module for ISIRI-3342 is broken: In the char -> Unicode
> direction, it does not convert 0x80 to U+0000, as described in the charmap.

That point isn't really very important (I doubt it would even be used).

> And the Unicode -> char conversion is completely broken, because the script
> to generate the .h file from the charmap cannot cope with the non-injective
> mapping: the from_ucs4 table contains extraneous elements.

In fact the problem arises because ISIRI-3342 has all its chas strong typed
(converning directionality).
The idea was to replicate all non-letters from ascii into the upper half
of the byte, so you have two spaces, two tabs, two set of parenthesis, etc.
One string typed left to right, the other strong typed right to left.

Unicode doesn't have that (it only strong type the letters: latin letters left
to right, arabic letters right to left; but almost everything else will depend
of the context).

In other words: a correct conversion ISIRI-3342 <-> unicode needs to know
about the context.
I don't know if iconv can handle that.
If it is possible to know the active direction at any point inside a unicode
text string, then it would be no problem to correctly convert to isiri-3342;
so if glibc can provide a way to know that, it would be the perfect solution.

On the other way it is more complex, as a real correct conversion may involve
reordering the flow of chars.
On the other hand an almost blind conversion (using the values in the charset
description file) would be good enough in most of the cases.

> Here is a fix.

> *** glibc-20000729/iconvdata/isiri-3342.h.bak	Sun Jul 30 18:30:44 2000
> --- glibc-20000729/iconvdata/isiri-3342.h	Sun Jul 30 18:25:03 2000

...

> + static const struct gap from_idx[] = {
> +   { start: 0x0000, end: 0x007f, idx:     0 },
> +   { start: 0x00a4, end: 0x00a4, idx:   -36 },
> +   { start: 0x00ab, end: 0x00ab, idx:   -42 },

Can you explain me brievly the meaning of this ? What is the idx: value ?

> + static const char from_ucs4[] = {

> +   /* 0x06a9..0x06af */
> +   '\xda', '\x00', '\x00', '\x00', '\x00', '\x00', '\xdb',

Why not

  /* 0x06a9..0x06a9 */
  '\xda',
  /* 0x06af..0x06af */
  '\xdb',

instead ?



Another thing; there are some chars that are close in shape, and sometimes
are interchanged; for example I've seen web pages in Farsi language written
in utf-8 that uses the unicode char ALEF MAKSURA in place of FARSI YEH (the
shapes are almost identical). A strict conversion of such page to an Iranian
encoding such as ISIRI-3342 will have plenty of holes.
Does iconv have any provision for "loosy" conversions, while they are not
"correct" from a strict table value point of vue, that kind of conversion is
really much more useful for a human reader, as the strict one is useless,
the loosy one provides a readable text.

A similar problem may happen with the arabic digits; the 0-9 digits exist
in two shapes: one used in northern africa (0x0660-0x0669), the other
used in Pakistan, Iran, etc. (0x06f0-0x06f9). However, the shapes of
various of those digits are the same, and it may be possible that some
unicode editors put the wrong values.
Anyway, it may be desirable, from a user perspective, to convert both sets
of digits to the digits 0xb0-0xb9 in ISIRI-3342

Thank you very much.

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/		PGP Key available, key ID: 0x8F0E4975

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]