This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Handling numbers input/output in glibc

From: Behdad Esfahbod <behdad at cs dot toronto dot edu>
To: libc-alpha at sources dot redhat dot com
Date: Sat, 10 Jan 2004 09:39:42 -0500
Subject: Handling numbers input/output in glibc
Sorry if you receive this mail twice.  Some people told me that
it may reach more people here.  Replies to the originally CCed
people is appreciated.

behdad

---------- Forwarded message ----------
Date: Sat, 10 Jan 2004 09:02:16 -0500 (EST)
From: Behdad Esfahbod <behdad@cs.toronto.edu>
To: bug-glibc@gnu.org
Cc: Hamed Malek <hamed@bamdad.org>, Roozbeh Pournader <roozbeh@sharif.edu>,
     Markus Kuhn <mgk25@cl.cam.ac.uk>, Keld JÃrn Simonsen <keld@dkuug.dk>
Subject: On numbers or how I learnt not to drink but get drunk...

Hi,


This mail discusses my understanding of the C99[1] and Locale[2]
documents, and how glibc currently handles numerical data
input/output and how it should do.

Problem statement:  In Persian (fa_IR) locale, we like to read
and write numbers with Persian numerlas (U+06F0..U+06F9).

=======================

Output:

Currently glibc has implemented the "I" flag such that
printf("%Id", ...) prints a number using characters that are
defined in the "outdigit" tag of the LC_CTYPE locale category.
A patch for printf to do the same thing for floats ("%If") has
been recently applied.  By the way, there is currently no way to
define an equivalent for the "E" character used in "%e" format.

Related keywords from the locale model are "outdigit" in LC_CTYPE
and "decimal_point" and "thousands_sep" in LC_NUMERIC.  The
reason that "outdigit" is defined in LC_CTYPE is that according
to the document, outdigit "Define the characters to be classified
as decimal digits for output from an application, such as to a
printer or a display or a output text file.  Decimal digits
corresponding to the values <0>, <1>, ..., <8>, and <9> can be
specified, and in ascending order of the values they represent.
The intended use is for all places where decimal digits are used
for output, including numeric and monetary formatting, and date
and time formatting.  Only one set of 10 decimal digits may be
specified.  If this keyword is not specified, the decimal digits
0 through 9 of the portable character set automatically belong to
this class, with application-defined character values.  The
keyword may be omitted."

By now it should be clear that as it turns out from the
definition, the "outdigit" digits should be used for output
formatted by "%d".  In other words, the "%Id" should become the
default behaviour.  Same for "%If" family.

Note that the current implementations of "%d" and "%f" are buggy,
as they follow the locale on "decimal_point" and "thousands_sep",
but not "outdigit".  Unlike what I used to think, "outdigit" is
NOT a second set of digits, but the locale's set of digit.  So,
when we define them in Persian locale, it means that we do not
like to see Latin numerals in output from C library.

The change in glibc would only affect Persian locale, as it's the
only one that defines "outdigit" currently.  And do not worry
about Persian locale :).


Proposed changes/extensions to glibc:

- Make "%Id" behaviour default.  Same for "%If".

- Deprecate "I" flag.

- Having a flag "P" that outputs numbers in POSIX locale is
proposed.  Then "%Pd" can be used to get a number with ASCII
numerals and thousands separator.  This is quite useful when
writing to configuration files and other files that should not be
affected by user locale.

- Handle the following keyword in LC_NUMERIC:

  decimal_exp   The operand is a string containing the symbol
that is used as the decimal exponent sign as used in -1.234e+56.

This should be added to "struct lconv" in locale.h too.
Note that in Persian, we use "*10^", in Persian numerals and
"\times" instead of "*".

- Equivalents for "NaN"?


These should hopefully solve the output problem.

===========================

Input:

Much like printf, glibc has implemented the "I" flag for "%d",
such that "%Id" parses number presented in any set of digits of
the locale.  This digits can be identified by iswdigit()
function.

Proposed changes/extensions to glibc:

- Implement "I" flag for "%f" family.

- Make "%Id" behaviour default.  Same for "%If".

- Deprecate "I" flag.

- Having a flag "P" that parses numbers in POSIX locale is
proposed again.  This is quite useful when reading configuration
files that use Latin numerals instead of user's locale numerals.

- Handle the "decimal_exp" keyword when reading floats.



There is still another problem, in data files.  For scanf to work
with different digit sets, they should be defined in "digit"
keyword under LC_CTYPE.  There is a small problem, that is in
/usr/share/i18n/locales/i18n (in a Red Hat system, don't know how
should I address this file otherwise).  This file defines all
non-ASCII number characters as letters ("alpha") instead of
"digit", and for reason it says:

% The non-ASCII number characters are included here because ISO C 99    /
% forbids us to classify them as digits; however, they behave more like /
% alphanumeric than like punctuation.

An implication of this definition is that locale definitions that
include this file ('copy "i18n"') cannot define any set of
digits, as it would conflict with the "alpha" definitions of the
same character.

By the way, I guess it has been a misunderstanding of the C99
standard, as I cannot verify that:

C99 section 7.4.1.5 "The isdigit function" paragraph 2 says:
"The isdigit function tests for any decimal-digit character (as
defined in 5.2.1)".

But in section 5.2.1, it defines the source and execution
character sets, and divides each into two sets: base character
set, whose contents are given by that subclause, and a set of
zero or more locale-specific members (which are not members of
the basic character set) called extended characters.  It requires
the characters in basic set to be represented by one byte, but
not for extended characters.  So I cannot see where it forbids us
to classify non-ASCII number characters as digits.

Proposed fix should be trivial.  This magically enables everyone
to enter numbers in their own local digits, Wow!

========================

Date/Time:

The definition of "outdigit" says that the same set of digits
should be used in Date/Time formatting.  This should be
implemented.  By the way, the "alt_digits" keyword from LC_TIME
should override this behaviour.

========================

Feasibility:

I assume that because of the internals of glibc, the proposed
fixes to printf and scanf would fix other related functions like
strtol, strotod, and so.

The proposed fixes about implementing "I" for "%f" in scanf and
deprecating "I" and making its behaviour the default is one step
toward conforming to the two standards.  The fix in "i18n" file
is too.

And about things that may break, the scanf behaviour does not
break anything that is already working.  Niether does printf
change.  The only negative outcome would be that one cannot get
a number with Latin digits from printf with a Persian locale.
But this is not a problem for a couple of reasons, first, the
current implementation uses Persian (locale) thousands_sep with
Latin digits, which is not corrent.  And second, writing numbers
with ASCII digits is easy.  Don't worry :).

=========================

Alternate solution:

For a few weeks now, we have been following another path to the
problem.  That was to add another set of thousands_sep and
decimal_point that go with outdigits, and keep the current
thousands_sep and decimal_point as those that should be used with
Latin digits.  This method is definitely very wrong, and would
cause several problems (struct lconv...).

Bruno:  I think my proposed change to gettext should be canceled,
or delayed to after resolution of this thread.



Thanks everyone for reading to this point :),
behdad


References:

1. Programming Languages -- C, ISO/IEC 9899:1999, available at:
   http://www.nirvani.net/docs/ansi_c.pdf

2. Information technology -- Specification method for cultural
   conventions, ISO/IEC TR 14652:2002-08-12 Final text, available at:
   http://anubis.dkuug.dk/jtc1/sc22/wg20/docs/n972-14652ft.pdf
Follow-Ups:
- Re: Handling numbers input/output in glibc
  - From: Ulrich Drepper
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]