This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: "C" character set (again)

From: Thomas Wolff <towo at towo dot net>
To: cygwin-developers at cygwin dot com
Date: Fri, 08 Jan 2010 16:53:40 +0100
Subject: Re: "C" character set (again)
References: <416096c60912282254r7230cbaeiad6b3432f7c15257@mail.gmail.com> <4B3A0379.1090508@byu.net> <416096c60912290611y1a946525r4af6b2727b3725a7@mail.gmail.com> <20100107100611.GA23972@calimero.vinschen.de> <4B461D6B.10503@gmail.com> <416096c61001071246u38e53bb6neae901ab1d8928b1@mail.gmail.com> <4B471307.3000704@towo.net> <20100108114140.GA23992@calimero.vinschen.de> <4B4743D2.1020401@towo.net> <4B474AB1.1000402@byu.net>

Eric Blake wrote:

According to Thomas Wolff on 1/8/2010 7:40 AM:
While Andy had a valid point in finding *format* to be described as a "character string" and relating that to a generic POSIX definition of character, this certainly does not justify the current behaviour of slient dropping and reporting partial success because that is not one of the options in the "RETURN VALUE" section; also I don't see what Andy's claim "Including invalid bytes in the format string is undefined behaviour." is based on.

Part of my point was to work this out more precisely, so please let me stir around once more:

Per POSIX, printf is only defined if you pass a valid character string as the format.

Based on what? The manpage lists a number of cases of "results are undefined" explicitly (e.g. insufficient arguments) but the case of an invalid character in the format string is *not* among them, unless you would account the EILSEQ clause to that aim, which I wouldn't because the format string is not a wide character string. So for this reasoning, you could only base on deduction of general descriptions (using the word "character") in relation to generic definitions (somewhere else in POSIX). This is a weaker point, however, and not in line with the general tendency of POSIX to describe the API clearly. And it has to be viewed in contrast with sections "RETURN VALUE" and "ERRORS" which *are* described clearly. So *if* one could arguably sustain the view that invalid multi-byte characters in the format wouldn't need to be handled transparently, it would be mandatory for printf to return -1 in that case and set errno=EILSEQ.

If you pass an 8-bit value but it is not a character, then
you did not pass a valid format string.

That may be the case (I'm almost convinced meanwhile by Andy and you) but yet... see above, it doesn't mean "completely undefined".

That's why your behavior was undefined, and so ANYTHING can happen

No, not quite anything, APIs are not pure maths, Return Value conditions still have to be met.

...

My opinion is that it would still be nice to keep "C" in the UTF-8 charset
(to encourage people to fix their programs that do not comply with POSIX
rules about the C locale), but to fix 8-bit transparency issues in as many
APIs as possible (such as printf) so that invalid characters are at least
still handled as transparently-clean 8-bit bytes.

Yes, please.

In the long run, sticking with the UTF-8 charset will only be doing users a favor, even if we end up having to point people to the FAQ about locale implications.

With this, I fully agree.

------
Thomas

Follow-Ups:
- Re: "C" character set (again)
  - From: Eric Blake

References:
- Re: "C" character set (again)
  - From: Corinna Vinschen
- Re: "C" character set (again)
  - From: Dave Korn
- Re: "C" character set (again)
  - From: Andy Koppe
- Re: "C" character set (again)
  - From: Thomas Wolff
- Re: "C" character set (again)
  - From: Corinna Vinschen
- Re: "C" character set (again)
  - From: Thomas Wolff
- Re: "C" character set (again)
  - From: Eric Blake

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]