This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

representing charsets

From: Andy Koppe <andy dot koppe at gmail dot com>
To: cygwin-developers at cygwin dot com
Date: Tue, 30 Mar 2010 12:49:31 +0100
Subject: representing charsets

Corinna Vinschen:
> On Mar 29 19:19, Andy Koppe wrote:
>> Corinna Vinschen:
>> > Anyway, feel free to send a patch to change the charset name parameter
>> > to an array index parameter.
>>
>> Attached is a newlib patch for that. The core of it is the removal of
>> the calls to __cp_index and __iso_8859_index from the singlebyte
>> charset conversion functions and adding a __charset_index global
>> variable, but quite a lot of function definitions and calls needed to
>> be changed accordingly to take an index argument instead of a string.
>
> Two problems:
> - The usage of the __charset_index variable should be changed to a call
>  to a function __locale_charset_idx (), analog to the __locale_charset ()
>  function.  The reason is that, in the long run, we will implement
>  the _l family of functions plus the newlocale/uselocale stuff from
>  POSIX-1.2008.  The global information will be replaced by locale_t
>  structures, basically.  For that we need access wrapper functions which
>  allow to use the right locale_t for the current thread.

Ah, I had wondered why those wrappers were there. Will do.


> - The __monetary_load_locale function and friends, as well as the
> Âsubsequently calld Cygwin functions should still get the charset
> Âname. ÂIn a later incarnation(*) they will store the charset names
> Âin the locale information.

I see. How shall I tackle these?

1) Pass the charset string as well as the index into these functions.
2) Go back to passing the string only, and introduce a 'int
__charset_index(const char *charset)' function that converts it to an
index where needed.

Of those two, I prefer the second for its cleaner API. It would be
slightly slower, but only in setlocale(), which isn't critical.

But actually what I'd really like to do is this:

3) Represent charsets as enum constants (or #defines) rather than
strings throughout, with the singlebyte charsets ordered in such a way
that they correspond to their order in the conversion tables, along
these lines:

enum {
  CS_UTF8 = 0,

  /* ISO singlebyte codepages */
  CS_ISO8859_1 = 1,
  CS_ISO8859_2 = 2,
  ...
  CS_ISO8859_11 = 11,
  /* ISO-8859-12 doesn't exist */
  CS_ISO8859_13 = 12,
  ...
  CS_ISO8859_16 = 15,

  /* Windows singlebyte codepages */
  CS_CP437 = 100,
  CS_CP720 = 101,
  CS_CP737 = 102,
  ...

  /* Multibyte codepages */
  CS_SJIS = 200,
  CS_GBK = 201,
  ...
}

Obviously, this would require quite a bit of additional work, but I do
think it would be cleaner and a bit more efficient than the current
model. Do you think this is worth pursuing (on the newlib list)?


>> I was concerned I might forget to change a prototype or call
>> somewhere, but actually Cygwin does use all the functions in question,
>> I think.
>
> Yes, except for __jis_mbtowc/__jis_wctomb.

Good point, I'll pay extra attention to those.

Andy

Follow-Ups:
- Re: representing charsets
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]