This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Cygwin fails to utilize Unicode replacement character

From: Steven Penny <svnpenn at gmail dot com>
To: cygwin at cygwin dot com
Date: Tue, 04 Sep 2018 14:05:51 -0700 (PDT)
Subject: Re: Cygwin fails to utilize Unicode replacement character
References: <CAJ1FpuNrMhfB-cmKSiQbj_JB2F_GymCzuv_kY2K9M7RuFqr8Rw@mail.gmail.com>

On Tue, 4 Sep 2018 13:59:10, Doug Henderson wrote:

My preference is to remove the output fiddling code that Corrina has
been working on. It is trying to solve the wrong problem.
I think we have gone down a rabbit hole at the wrong end of cat's data flow.


this has nothing to do with "cat". it has to do with the unfounded design
decision to use U+2592. Granted at this point we are bikeshedding - but an
official standard does exist, namely Unicode, with 2 applicable characters for
this use case:

1. U+FFFD: http://unicode.org/charts/nameslist/n_FFF0.html
2. U+25A1: http://unicode.org/charts/nameslist/n_25A0.html

Should any changes to the way a character is displayed be required, it
needs to be in the terminal program that display the character, not in
cygwin which should pass the character along unmodified.


the "terminal" in this case is either "cygwin" or "xterm" - in both cases code
changes have already been made in reponse to this thread, so i dont think your
comment here holds weight.

Both cygwin and Debian 9.5 show:

    $ file alfa.txt
    alfa.txt: ISO-8859 text

When Linux reads the file, it assumes the encoding is UTF-8.
When cygwin reads the file, it assume the encoding is CP1252
This command shows the problem

    $ iconv -f utf8 alfa.txt
    iconv: alfa.txt:1:0: incomplete character or shift sequence

On Linux, this shows a slightly different message, with the same intent.

Try using this string:

    $ printf "\xC3\xAB\353\n"
    =C3=AB=E2=96=92

to get a better understanding of the problem. It contains two
representation of LATIN SMALL LETTER E WITH DIAERESIS, first encoded
in UTF-8, then using ISO-8859-1.


now it appears *you* are going down the rabbit hole. both Cygwin and Mintty were
in violation on Unicode standard - however this has already been remedied in the
code.

There are two different reasons for the MEDIUM SHADE. Here it
indicates an invalid UTF-8 character, and the font does not have a
glyph for REPLACEMENT CHARACTER. The MEDIUM SHADE is also used in
place of an ordinary character without a glyph in the font.


this is flat wrong. U+2592 MEDIUM SHADE is *only* used in cases of invalid
UTF-8. In case of missing character - the ".notdef" glyph is used - as has been
discussed several times in this thread. This is not an actual character, so i
cannot paste it here - but as an example with "DejaVu Sans Mono" the glyph is
an empty rectangle.


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

References:
- Re: Cygwin fails to utilize Unicode replacement character
  - From: Doug Henderson

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]