Need help with multibyte UTF-8 characters

Thomas Taylor tayloth@gmail.com
Tue Dec 12 20:17:00 GMT 2017


I believe that Cygwin displays certain UTF-8 characters incorrectly.  To 
see the problem, first save the attached "utf-8_test.sed" text file to 
your desktop.  Then run "mintty," and set its options by right clicking 
in its title bar, selecting "Options" and then "Text."  On the Text page 
set "Locale" to "en_US" and "Character set" to "UTF-8," and then 
"Save."  Now exit and restart mintty.  Change directory to your desktop 
and run the editor "vim" on the utf-8_test.sed file.  Once inside vim do 
a ":set fileencoding=utf-8".  You should now see that vim displays 
correctly a sample of one-, two-, and three-byte UTF-8 character 
encodings in the test file.  Vim fails, however, on the three-byte 
encodings for the "en" dash, the "em" dash, and the ellipsis, each of 
which displays incorrectly as a filled-in rectangle.  Now exit vim and 
do a "less" or "cat" on the utf-8_test.sed file.  You should see most of 
the sample UTF-8 encoded characters displayed correctly, except once 
again for the en dash, em dash, and ellipsis.  So it looks like a 
problem in the underlying Cygwin run-time libraries rather than in vim, 
less, or cat.  I haven't tested this on four-byte UTF-8 character 
encodings, but assume Cygwin will have similar problems.

-------------- next part --------------
# This is file "utf-8_test.sed"
#
# It's used by the "sed" utility program
# to convert XML-encoded filenames to UTF-8

# Match longest strings first

# Three-byte encodings:

# En dash
s/%[Ee]2%80%93/–/g

# Em dash
s/%[Ee]2%80%94/—/g

# Horizontal ellipsis
s/%[Ee]2%80%[Aa]6/…/g

# Less-than-or-equal sign
s/%[Ee]2%89%[Aa]4/≤/g

# Euro symbol
s/%[Ee]2%82%[Aa][Cc]/€/g

# Two-byte encodings:

# Non-break space
s/%[Cc]2%[Aa]0/⎵/g

# Lowercase a with acute accent
s/%[Cc]3%[Aa]1/á/g

# Lowercase a with umlaut (a.k.a. diaeresis)
s/%[Cc]3%[Aa]4/ä/g

# Lowercase e with acute accent
s/%[Cc]3%[Aa]9/é/g

# Lowercase i with acute accent
s/%[Cc]3%[Aa]D/í/g

# Lowercase o with acute accent
s/%[Cc]3%[Bb]3/ó/g

# Lowercase n with tilde
s/%[Cc]3%[Bb]1/ñ/g

# Lowercase c with acute accent 
s/%[Cc]4%87/ć/g

# Lowercase o with long accent (a.k.a. macron)
s/%[Cc]5%8[Dd]/ō/g

# One-byte encodings:

# "And" sign (a.k.a. ampersand)
s/&/\&/g

# Space
s/%20/ /g

# Sharp (or pound) sign
s/%23/#/g

# Percent sign
s/%25/%/g

# Left square bracket
s/%5[Bb]/[/g

# Right square bracket
s/%5[Dd]/]/g

# End of file "utf-8_test.sed"

-------------- next part --------------

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


More information about the Cygwin mailing list