representing charsets

Tue Mar 30 14:47:00 GMT 2010

On Mar 30 12:49, Andy Koppe wrote:
> Corinna Vinschen:
> > Two problems:
> > - The usage of the __charset_index variable should be changed to a call
> >  to a function __locale_charset_idx (), analog to the __locale_charset ()
> >  function.  The reason is that, in the long run, we will implement
> >  the _l family of functions plus the newlocale/uselocale stuff from
> >  POSIX-1.2008.  The global information will be replaced by locale_t
> >  structures, basically.  For that we need access wrapper functions which
> >  allow to use the right locale_t for the current thread.
> 
> Ah, I had wondered why those wrappers were there. Will do.

If that wasn't entirely clear, the locale_t structures will also contain
wctomb/mbtowc and ctype function pointers, charset names, the array
indices were talking about, etc.

> > - The __monetary_load_locale function and friends, as well as the
> > Â subsequently calld Cygwin functions should still get the charset
> > Â name. Â In a later incarnation(*) they will store the charset names
> > Â in the locale information.
> 
> I see. How shall I tackle these?
> 
> 1) Pass the charset string as well as the index into these functions.

This one.  Typically the charset name is used for informational
purposes.  The index is only used as an additional information for the
iso and cp functions.  The important information are the __mbtowc and
__wctomb pointers and, later, the pointers to the ctype arrays as part
of a locale_t.

As for the charset name, it's just stored away to print it.  Sure,
we could fetch this information from the locale string, but the work
has already been performed in loadlocale, so I don't see a reason to
repeat it.  As for performance, I think we can neglect that in this
case, given that setlocale is typically only called once.

> But actually what I'd really like to do is this:
> 
> 3) Represent charsets as enum constants (or #defines) rather than
> strings throughout, with the singlebyte charsets ordered in such a way
> that they correspond to their order in the conversion tables, along
> these lines:
> 
> enum {
>   CS_UTF8 = 0,
> 
>   /* ISO singlebyte codepages */
>   CS_ISO8859_1 = 1,
>   CS_ISO8859_2 = 2,
>   ...
>   CS_ISO8859_11 = 11,
>   /* ISO-8859-12 doesn't exist */
>   CS_ISO8859_13 = 12,
>   ...
>   CS_ISO8859_16 = 15,
> 
>   /* Windows singlebyte codepages */
>   CS_CP437 = 100,
>   CS_CP720 = 101,
>   CS_CP737 = 102,
>   ...
> 
>   /* Multibyte codepages */
>   CS_SJIS = 200,
>   CS_GBK = 201,
>   ...
> }

But what is that good for?  Which advantage do you have?  If you
only keep the number, where do you get the charset name from?

Btw., while I was writing the above, it occured to me that we
don't really need the index into the iso or cp array.  What we
really need is a pointer to the array member, which can be used
immediately.   For instance, instead of this code in __iso_wctomb:

_DEFUN (__iso_wctomb, (r, s, wchar, charset, state)
  [...]
  int iso_idx = __iso_8859_index (charset + 9);
  [...]
  if (__iso_8859_conv[iso_idx][mb] == wchar)
  [...]

we would just have

_DEFUN (__iso_wctomb, (r, s, wchar, charset_conv_ptr, state)
  [...]
  if (charset_conv_ptr[mb] == wchar)
  [...]

The charset_conv_ptr would be NULL for all chrsets which don't use
chrset tables.

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat