This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: UTF-8 character encoding

From: Michael Enright <mike at kmcardiff dot com>
To: cygwin at cygwin dot com
Date: Tue, 26 Jun 2018 14:39:35 -0700
Subject: Re: UTF-8 character encoding
References: <CAD8GWss253v-p+FjeonEqibr53v6wZRCQ+NWxBhb0LimQaM4sQ@mail.gmail.com> <1183751257.20180621042620@yandex.ru> <CAD8GWsuo3PuQSdSyMRhbxZQXa=GUSBcyes7QEaqDYfh3FCof0Q@mail.gmail.com> <5B3045B1.4080504@tlinx.org> <CAD8GWsuevQX6fBUzkEvUs5rBPehhG7-ht+FPZU=eOaACF5uCPg@mail.gmail.com>

On Mon, Jun 25, 2018 at 11:33 AM, Lee <ler762@gmail.com> wrote:
> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
> 0xff is part of the utf-8 encoding.

I don't see how you arrived at this. An initial byte of 0xFF is not
the initial byte of any valid UTF-8 byte sequence. And it doesn't
conform with the statement you have later:

>  An easy way to remember this transformation format is to note that the
>  number of high-order 1's in the first byte is the same as the number of
>  subsequent bytes in the multibyte character:

This is true, but there is also a zero bit that ends the
high-order-1's bit string, which means that 0xFF is not a valid lead
byte. 0x7F is the highest byte value that you can have as a
single-byte UTF8 string.

Perhaps your statement about 0-0xFF was meant to be read differently.

Thomas Wolff's note seems to be objecting to the inclusion of
characters above U+10FFFF which isn't legal UTF-8, but was in the
original proposal. Otherwise your table rows 1-4 is correct.

The standards such as IETF RFC-3629 are easy enough to read, so I
recommend using them and citing them to others instead of trying to
summarize.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Follow-Ups:
- Re: UTF-8 character encoding
  - From: Lee

References:
- UTF-8 character encoding
  - From: Lee
- Re: UTF-8 character encoding
  - From: Andrey Repin
- Re: UTF-8 character encoding
  - From: Lee
- Re: UTF-8 character encoding
  - From: L A Walsh
- Re: UTF-8 character encoding
  - From: Lee

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]