Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Andy Koppe andy.koppe@gmail.com
Sun Sep 27 08:34:00 GMT 2009


>> The __utf8_wctomb function could just create the corresponding
>> UCS-2 values if no first half has been encountered before.  The
>> __utf8_mbtowc function could simply allow these UCS-2 values again.
>>
>> That works (I just tested it) and is a small change, but is it really
>> desirable to allow UCS-2 values in UTF-8 strings?
>
> I don't know.

Improved answer: Debian allows them!

$ cat test.c
#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

int main(int argc, char *argv[]) {
  puts(setlocale(LC_CTYPE, "") ?: "fail");
  int arg = 0;
  char s[8];
  wchar_t wc;
  if (argv[1])
    sscanf(argv[1], "%x", &arg);
  int l = wctomb(s, arg);
  printf("%i\n", l);
  l = mbtowc(&wc, s, l);
  printf("%i\n", l);
  printf("%x\n", wc);
}

$ LC_CTYPE=en_GB.UTF-8 ./a.out d800
en_GB.UTF-8
3
3
d800

$ LC_CTYPE=en_GB.UTF-8 ./a.out dc00
en_GB.UTF-8
3
3
dc00



More information about the Cygwin-developers mailing list