This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [PATCH/RFA] Distinguish between EOF and character with value 0xff


Summary:  Yes, we should do this, but there is still a possible problem
if programmers are not careful--and the library can do nothing to avoid
that potential problem; this is really only trading one problem for
another.
 
Details:
It is impossible for the character functions to differentiate between
EOF==-1 and 0xFF if the character comes from a signed char, because the
0xFF is really -1 for signed char and gets promoted to int -1 for
passing.
That is, there's a vicious catch-22 here:  either (signed char)255 is
incorrectly identified, or EOF==-1 is, as the functions are given -1 for
either.  The only way to completely fix the problem is for signed char
to
not be allowed--which is to say that the problem cannot be absolutely
fixed by the library.  (I suppose that another way in which it
theoretically could have been fixed is for the value 0xFF to be not
permitted in characters sets, but since you said that they've used it,
they missed the opportunity.)  Put another way, this patch is trading
the possible misidentification of EOF into the possible
misidentification
of 0xFF, when signed char is used.

The problem also applies to the macros as they currently are defined.
For example:

#define isalpha(c)      ((__ctype_ptr__)[(unsigned)((c)+1)]&(_U|_L))

If c is a signed char == 0xFF, it is really -1, so c+1 becomes 0,
and is then made into unsigned--it looks exactly the same as EOF (the
only difference being the starting point of int c=-1 vs. signed char
c=-1).
 
An example showing how the problem could surface:

#include <stdio.h>
...
int i, len;
signed char buf[80];	// or plain "char" when char defaults to signed
FILE *fp=fopen(...);
[change to new locale with 8-bit characters in which 0xFF is legal and
 considered a control character]
len = fread(buf, sizeof(buf[0]), sizeof(buf), fp);
for(i=0; i<len; i++)  if(iscntrl(buf[i]) something(i);

The iscntrl() call will fail to call something() when it gets 0xFF.
If the program used "unsigned char buf[80]", then there would be no
problem.  (Yes, the macro is also OK for uchar c because c gets promoted
to int for the addition operation, so 0xFF+1=0x100--no 0xFF+1 wraps to
0.)
 
It is impossible for the problem to be avoided for the actual functions.
We maybe could avoid it for the macros, but it would be very messy even
if possible.  The question is if we attempt to do anything with the
macros,
or leave it as it is (having the macros behave differently from the
functions is not a good thing).  This potential problem has always
existed,
even before multiple locales were worked on.  Programmers should know to
use unsigned char for strings if there's any chance of non-ASCII values.
Should any attempt be made to make it cleaner for those who don't?
Which at best can only be a partial fix because the functions cannot be
fixed.  But then if we count on programmers to use unsigned char only,
then there is no real reason to allow for the -128 to -2 values.  (And
allowing them could even be considered a disservice, as it makes the
problem less likely to happen and therefore harder to find.  No, I'm
not suggesting that we back that part out, but simply pointing out that
it is of questionable overall value.)

Craig

-----Original Message-----
From: newlib-owner@sourceware.org [mailto:newlib-owner@sourceware.org]
On Behalf Of Corinna Vinschen
Sent: Tuesday, April 21, 2009 2:08 PM
To: newlib@sourceware.org
Subject: [PATCH/RFA] Distinguish between EOF and character with value
0xff

Hi,


There's a bug in the new character class tables for Windows and ISO
charsets.

To support signed chars, the tables for the negative values -128..-1 are
identical to the values of the positive values 128..255.  Many of these
character sets have a valid character at the position 255.  So some
functions return a non-0 value not only for the unsigned char value 255,
but also for the equivalent signed char value -1.  Unfortunately this
potentially breaks applications which use the EOF value as argument to
the ctype functions.  They expect that the functions always return 0,
but in the current implementation they don't.

The below patch fixes that.  It splits off the value for char 255 from
the
rest of the definition, so that the actual character class tables can
return another value for the unsigned char value 255 than for -1.

For instance, the former definition for the ISO-8859-1 table looked
like this:

    { _CTYPE_ISO_8859_1_128_256,
      _CTYPE_DATA_0_127,
      _CTYPE_ISO_8859_1_128_256
    },

The new definition now looks like this:

    { _CTYPE_ISO_8859_1_128_254,
      0,
      _CTYPE_DATA_0_127,
      _CTYPE_ISO_8859_1_128_254,
      _CTYPE_ISO_8859_1_255
    },

While I was at it I also took the liberty to rename _CTYPE_DATA_128_256
to _CTYPE_DATA_128_255, which is more correct since the definitions
contains
the character values 128..255, not 128..256.

Ok the apply?


Thanks,
Corinna


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]