This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
PATCH] [BZ 14094 13064] Update locale data to Unicode 7.0.0
- From: Pravin Satpute <psatpute at redhat dot com>
- To: libc-alpha at sourceware dot org
- Date: Fri, 04 Jul 2014 14:48:20 +0530
- Subject: PATCH] [BZ 14094 13064] Update locale data to Unicode 7.0.0
- Authentication-results: sourceware.org; auth=none
- References: <53B65D65 dot 5050008 at redhat dot com>
Hi,
I have worked on updating UTF-8 file to Unicode 7.0.
Patch size is around 1.2MB, looks like libc-alpha not allowing me post
that size attachment.
Attached patch [1] to bug [2].
1. Present patch is only for CHARMAP, patch for updating WIDTH will be
available soon.
2. utf8-gen.py: New script to generate UTF-8 file.
3. patch is created by ignoring space changes (-w)
4.
''' Where UnicodeData.txt file has given characters in range
Example:
3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
UTF-8 file mention these range by adding 0x3F inbetween First and
Last Unicode character.
Example:
<U3400>..<U343F> /xe3/x90/x80 <CJK Ideograph Extension A>
.
.
<U4D80>..<U4DB5> /xe4/xb6/x80 <CJK Ideograph Extension A>
* Note: No idea why Hangul syllable AC00; D7A3; were not expanded in
Unicode **
** 5.0 UTF-8. We are following consistency and expanding Hangul as
well.**
* '''
5. Name changes are in UnicodeData.txt in some cases.
''' Some characters have <control> as a name, so using "Unicode 1.0
Name"
Characters U+0080, U+0081, U+0084 and U+0099 has "<control>" as a
name and even no "Unicode 1.0 Name" (10th field) in UnicodeData.txt
We can write code to take there alternate name from NameAliases.txt '''
Let me know if any issues, doubt or improvement possible.
Best Regards,
Pravin Satpute
1. https://sourceware.org/bugzilla/attachment.cgi?id=7679
2. https://sourceware.org/bugzilla/show_bug.cgi?id=14094