representing charsets

Wed Mar 31 08:35:00 GMT 2010

On Mar 31 06:53, Andy Koppe wrote:
> Corinna Vinschen:
> > Andy Koppe:
> >> 3) Represent charsets as enum constants (or #defines) rather than
> >> strings throughout, with the singlebyte charsets ordered in such a way
> >> that they correspond to their order in the conversion tables, along
> >> these lines:
> >>
> >> enum {
> >> Â  CS_UTF8 = 0,
> >>
> >> Â  /* ISO singlebyte codepages */
> >> Â  CS_ISO8859_1 = 1,
> >> Â  CS_ISO8859_2 = 2,
> >> Â  ...
> >> Â  CS_ISO8859_11 = 11,
> >> Â  /* ISO-8859-12 doesn't exist */
> >> Â  CS_ISO8859_13 = 12,
> >> Â  ...
> >> Â  CS_ISO8859_16 = 15,
> >>
> >> Â  /* Windows singlebyte codepages */
> >> Â  CS_CP437 = 100,
> >> Â  CS_CP720 = 101,
> >> Â  CS_CP737 = 102,
> >> Â  ...
> >>
> >> Â  /* Multibyte codepages */
> >> Â  CS_SJIS = 200,
> >> Â  CS_GBK = 201,
> >> Â  ...
> >> }
> >
> > But what is that good for? Â Which advantage do you have?
> 
> - No need to pass around both charset name and the charset table index.
> - The __cp_index and __iso8859_index functions can be junked.
> __cp_mbtowc/wctomb obtain the index with (cs_id - CS_CP437). Similar
> for ISO.
> - Only one list of valid codepages (since the one in __cp_index can go).
> - Get rid of the hack where the likes of KOI8-R or PT154 are
> internally represented as "CPxxx" names, some of which don't actually
> correspond to Windows codepages.
> - All those strcpy() calls in setlocale become simple assignments,
> e.g. charset_id = CS_EUCJP instead of strcpy(charset, "EUCJP"). Not
> relevant performance-wise, but in terms of space (for embedded
> targets).
> - Similarly, charset comparisons become simple integer comparisons
> instead of strcmps.

Hmm, ok.

> > If you
> > only keep the number, where do you get the charset name from?
> 
> A new function, e.g. 'void __get_charset_name(int cs_id, char *buf)',
> where a buffer of size ENCODING_LEN+1 needs to be passed in.
> nl_langinfo(CODESET) would simply call that  instead of doing its own
> strcmp-heavy parsing of internal names to turn them back into official
> names.

Actually the codesets for all LC_FOO categories is supposed to be stored
in the LC_FOO datastructure soon.  So the call to __get_charset_name
should be performed in the __FOO_load_locale functions.

Before you start I'd like to apply my patch from
http://sourceware.org/ml/newlib/2010/msg00221.html first.  This
already contains a change to nl_langinfo, which just fetches the
charset from the locale info.  At least at this point your and my
patch would clash.  With my patch, you only have to change the
__FOO_load_locale functions, but not nl_langinfo anymore.

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat