Fwd: odd behavior of length(), match() and field splitting with multi-byte characters

Tue Aug 20 17:23:21 GMT 2024

There do seem to be anomalies in Cygwin handling of SMP characters, perhaps due 
to conversion to or misinterpretation as UTF-16/UCS-2 surrogates?

  🔍  U+01f50d  f0 9f 94 8d  d83d dd0d
  🔎  U+01f50e  f0 9f 94 8e  d83d dd0e

$ wc -lwcmL <<< 🔎
       1       0       3       5       0
$ wc -lwcmL <<< 🔍
       1       0       3       5       0

On 2024-08-20 04:58, Ed Morton via Cygwin wrote:
> Is there any more information I can provide for someone to be able to look into 
> this bug?
> 
>      Ed.
> 
> On 7/6/2024 7:26 AM, Ed Morton wrote:
>> I posted the below bug report to the GNU awk bugs mailing list, 
>> https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00000.html, the 
>> feedback there is that it's a cygwin or MSYS2 port issue, could you please 
>> take a look? I'm also posting this at 
>> https://github.com/msys2/mingw-packages/issues per the advice from the GNU bug 
>> list.
>>
>> Regards,
>>
>>     Ed Morton.
>>
>> -------- Forwarded Message --------
>> Subject:     odd behavior of length(), match() and field splitting with 
>> multi-byte characters
>> Date:     Mon, 1 Jul 2024 05:56:02 -0500
>> From:     Ed Morton
>> To:     bug-gawk@gnu.org <bug-gawk@gnu.org>
>>
>>
>>
>> Configuration Information [Automatically generated, do not change]:
>> Machine: x86_64
>> OS: cygwin
>> Compiler: gcc
>> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security 
>> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=4 
>> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1 -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1 -DNDEBUG
>> uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64 2024-04-03 
>> 17:25 UTC x86_64 Cygwin
>> Machine Type: x86_64-pc-cygwin
>>
>> Gawk Version: 5.3.0
>>
>> Attestation 1:
>>         I have read https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
>>         Yes
>>
>> Attestation 2:
>>         I have not modified the sources before building gawk.
>>         True
>>
>> Description:
>>         gawk is reporting odd lengths and matches of strings
>>         when multi-byte characters are involved.
>>
>> Repeat-By:
>>         Someone on StackOverflow asked about a couple of issues they saw that, 
>> so far at least, no-one there can explain and seem to just be bugs.
>>
>>         1) 
>> https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138715434_78676444 and https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138720207_78676444:
>>
>>         If we output 4 multi-byte characters as 10 bytes using:
>>
>>             $ echo '61F09F948DF09F948E62' | xxd -r -p > file1
>>             $
>>
>>         and run the following gawk command on it we get the output shown:
>>
>>             $ LC_ALL=en_US.utf8 gawk '{print(length($0))}' file1
>>             6
>>             $
>>
>>         i.e. 6 instead of 4. If we run
>>
>>             $ printf 'F0989A9F' | xxd -r -p | LC_ALL=en_US.utf8 awk -F '' 
>> '{print NF, length(); for (i=1; i<=NF; i++) print $i}' | cat -A
>>             2 2$
>>             M-pM-^XM-^Z$
>>             M-^_$
>>             $
>>
>>         it shows that what is intended to be single a 4-byte character is 
>> being treated as 2 characters, one 3 bytes and the other 1 byte.
>>
>>         2) 
>> https://stackoverflow.com/questions/78690533/why-does-the-match-function-not-work-in-this-particular-situation
>>
>>         If we create some input using:
>>
>>             $ echo 
>> '3C6469763E3C6469763E5F3C2F6469763E5F3C68313E6162636465665F3C2F68313E5F3C2F6469763E3C6469763EF09F93853C2F6469763E0A' | xxd -r -p > file2
>>
>>         and then run this on it we get the expected output shown::
>>
>>             $ LC_ALL=en_US.utf8 gawk '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); 
>> print a[1]}' file2
>>             abcdef
>>             $
>>
>>         but if we add the `IGNORECASE` flag we get a blank line output:
>>
>>             $  LC_ALL=en_US.utf8 gawk -vIGNORECASE=1 
>> '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2
>>
>>             $
>>
>>         unless we also remove the end of string delimiter, `$`, from the end 
>> of the regexp:
>>
>>             $ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1 
>> '{match($0,/^.*_<h1>(.*)_<\/h1>.*/,a); print a[1]}' file2
>>             abcdef
>>             $
>>
> 

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry