This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
representing charsets
- From: Andy Koppe <andy dot koppe at gmail dot com>
- To: cygwin-developers at cygwin dot com
- Date: Tue, 30 Mar 2010 12:49:31 +0100
- Subject: representing charsets
Corinna Vinschen:
> On Mar 29 19:19, Andy Koppe wrote:
>> Corinna Vinschen:
>> > Anyway, feel free to send a patch to change the charset name parameter
>> > to an array index parameter.
>>
>> Attached is a newlib patch for that. The core of it is the removal of
>> the calls to __cp_index and __iso_8859_index from the singlebyte
>> charset conversion functions and adding a __charset_index global
>> variable, but quite a lot of function definitions and calls needed to
>> be changed accordingly to take an index argument instead of a string.
>
> Two problems:
> - The usage of the __charset_index variable should be changed to a call
> to a function __locale_charset_idx (), analog to the __locale_charset ()
> function. The reason is that, in the long run, we will implement
> the _l family of functions plus the newlocale/uselocale stuff from
> POSIX-1.2008. The global information will be replaced by locale_t
> structures, basically. For that we need access wrapper functions which
> allow to use the right locale_t for the current thread.
Ah, I had wondered why those wrappers were there. Will do.
> - The __monetary_load_locale function and friends, as well as the
> Âsubsequently calld Cygwin functions should still get the charset
> Âname. ÂIn a later incarnation(*) they will store the charset names
> Âin the locale information.
I see. How shall I tackle these?
1) Pass the charset string as well as the index into these functions.
2) Go back to passing the string only, and introduce a 'int
__charset_index(const char *charset)' function that converts it to an
index where needed.
Of those two, I prefer the second for its cleaner API. It would be
slightly slower, but only in setlocale(), which isn't critical.
But actually what I'd really like to do is this:
3) Represent charsets as enum constants (or #defines) rather than
strings throughout, with the singlebyte charsets ordered in such a way
that they correspond to their order in the conversion tables, along
these lines:
enum {
CS_UTF8 = 0,
/* ISO singlebyte codepages */
CS_ISO8859_1 = 1,
CS_ISO8859_2 = 2,
...
CS_ISO8859_11 = 11,
/* ISO-8859-12 doesn't exist */
CS_ISO8859_13 = 12,
...
CS_ISO8859_16 = 15,
/* Windows singlebyte codepages */
CS_CP437 = 100,
CS_CP720 = 101,
CS_CP737 = 102,
...
/* Multibyte codepages */
CS_SJIS = 200,
CS_GBK = 201,
...
}
Obviously, this would require quite a bit of additional work, but I do
think it would be cleaner and a bit more efficient than the current
model. Do you think this is worth pursuing (on the newlib list)?
>> I was concerned I might forget to change a prototype or call
>> somewhere, but actually Cygwin does use all the functions in question,
>> I think.
>
> Yes, except for __jis_mbtowc/__jis_wctomb.
Good point, I'll pay extra attention to those.
Andy