This is the mail archive of the
newlib@sourceware.org
mailing list for the newlib project.
RE: [PATCH/RFA] Internationalize ctype functionality
- From: "Howland Craig D (Craig)" <howland at LGSInnovations dot com>
- To: <newlib at sourceware dot org>
- Date: Thu, 26 Mar 2009 21:55:17 -0400
- Subject: RE: [PATCH/RFA] Internationalize ctype functionality
- References: <20090326210123.GS12738@calimero.vinschen.de>
1) Wouldn't it be cleaner, especially in files in which it happens more
than once, to replace things like:
#ifdef __CYGWIN__
char __declspec(dllexport) *__ctype_ptr__ = _ctype_b + 127;
#else
char *__ctype_ptr__ = _ctype_b + 127;
#endif
with:
#ifdef __CYGWIN__
#define DLLEXPORT __declspec(dllexport)
#else
#define DLLEXPORT
#endif
char DLLEXPORT *__ctype_ptr__ = _ctype_b + 127;
(given that the only differences on the lines is the dll attribute)?
This would not only make ctype_.c more readable, but more maintainable.
2) I don't entirely understand the following, possibly due to my lack
of knowledge on the topic:
>- The toupper and tolower functions are now charset independent. If
the
> character is > 0x7f, it will be converted to wide char and then
> towupper/towlower is called on it.
> This is only a temporary solution. It works, but it's a bit sedated
> for native charaters. In the long run we should rather add
> upper/lower-case transformation tables, similar to the new ctype
> character class tables.
toupper and tolower operate on regular characters, which have a defined
range of unsigned-char-allowed-values and EOF. How can it work to
change it to a wide character except in the degenerate case when wide
characters are the same width as regular characters? That is, should
it be gated by a check that MB_CUR_MAX == 1? It seem dangerous to
try otherwise. What if lowercase ran from 0xE6-0xFF and uppercase
were 0x100-0x119? Or even worse, if lc was 0xE5-0xFE and uc was
0xFF-0x118? So you could convert 'a' but not 'b' through 'z'? (Where
it is unlikely that a and z are actually the first and last letters,
but I have to use something for sake of the example.) Or is there an
a-priori knowledge of all the characters sets being applied that says
this would be OK?
Does it only make sense to try at all unless MB_LEN_MAX == 1? (The user
should be using wide characters, not normal, if MB_CUR_MAX can be > 1,
shouldn't they?) In this case, "#ifdef _MB_CAPABLE" becomes
"#if defined(_MB_CAPABLE) && MB_LEN_MAX == 1".
If this feature is kept, I suggest that
char s[8] = { c, '\0' };
be changed to:
char s[MB_LEN_MAX+1] = { c, '\0' };
3) (both toupper.c and tolower.c do this)
+#ifdef _MB_CAPABLE
+ if ((unsigned char) c <= 0x7f)
+ return isupper (c) ? c - 'A' + 'a' : c;
+
+ char s[8] = { c, '\0' };
+ wchar_t wc;
+ if (mbtowc (&wc, s, 1) >= 0
+ && wctomb (s, (wchar_t) towlower ((wint_t) wc)) == 1)
+ c = s[0];
+ return c;
+#else
+ return isupper(c) ? (c) - 'A' + 'a' : c;
+#endif
The char s[8] and wchar_t lines will not work, coming in the middle
of a block, unless the compiler is C99 compliant. Does Newlib assume
(require) C99 compilers? (I hope so, but don't think so.)
(Interestingly enough, I tried this with gcc 3.4.4 in Cygwin with
-std=c89, and it actually allowed it. But I know that it will fail
with some gcc flavors even without -std=c89, as I just had it happen
yesterday on a cross compiler.)
Craig