Need help with multibyte UTF-8 characters
Thomas Taylor
tayloth@gmail.com
Tue Dec 12 03:43:00 GMT 2017
Thank you for your advice on setting my locale to en_US.UTF-8.Â
Unfortunately, Cygwin still seems to have trouble displaying some
three-byte UTF-8 encoded characters correctly. For example, see the
following snippet from a "sed" file. This file attempts to convert
XML-encoded filenames to UTF-8. As you can see, it converts one- and
two-byte encodings correctly, but fails on some three-byte encodings
(the en dash, the em dash, and the ellipsis, all of which are displayed
as a filled-in rectangle):
# Match longest strings first
# Three-byte encodings:
# En dash
s/%[Ee]2%80%93/â/g
# Em dash
s/%[Ee]2%80%94/â/g
# Horizontal ellipsis
s/%[Ee]2%80%[Aa]6/â¦/g
# Less-than-or-equal sign
s/%[Ee]2%89%[Aa]4/â¤/g
# Euro symbol
s/%[Ee]2%82%[Aa][Cc]/â¬/g
# Two-byte encodings:
# Non-break space
#s/%[Cc]2%[Aa]0/âµ/g
# Lowercase a with acute accent
s/%[Cc]3%[Aa]1/á/g
# Lowercase a with umlaut (a.k.a. diaeresis)
s/%[Cc]3%[Aa]4/ä/g
# Lowercase e with acute accent
s/%[Cc]3%[Aa]9/é/g
# Lowercase i with acute accent
s/%[Cc]3%[Aa]D/Ã/g
# Lowercase o with acute accent
s/%[Cc]3%[Bb]3/ó/g
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
More information about the Cygwin
mailing list