Is this correct behaviour for 'rev'?

Thu Oct 24 17:22:18 GMT 2024



Am 24.10.2024 um 15:56 schrieb Brian Inglis via Cygwin:
> On 2024-10-24 02:37, Thomas Wolff via Cygwin wrote:
>>
>> Am 24.10.2024 um 07:01 schrieb Mark Geisert via Cygwin:
>>> Replying to myself, I continue...
>>>
>>> On 10/22/2024 10:33 PM, Mark Geisert via Cygwin wrote:
>>>> On 10/22/2024 8:00 PM, Backwoods BC via Cygwin wrote:
>>>>> It appears that 'rev' is choking on any character \x80 or higher, but
>>>>> is OK with those \x1f or smaller. It doesn't give an error or ignore
>>>>> it, it just stops.
>>>>>
>>>>> I don't have access to a Linux box so I can't see if this happens
>>>>> there and nothing in the documentation suggests that this is the
>>>>> correct functionality.
>>>>>
>>>>> Test case:
>>>>> printf 'no non-ASCII characters\nhex 01 >\x01< here\nhex 80 >\x80<
>>>>> here\nLine 4\n'|rev|rev
>>>>>
>>>>> This is for "rev from util-linux 2.33.1"
>>>>>
>>>>> I don't have the current version of 'rev' on my system due to not
>>>>> having updated in a while. I accidentally screwed up my installation
>>>>> and have been reluctant to wipe it and start over.
>>>>>
>>>>> So, is this the expected behaviour for the current version of 'rev'
>>>>> under Cygwin and/or Linux?
>>>>
>>>> The current Cygwin util-linux 2.39.3-2 rev behaves in the same,
>>>> broken way.  It looks like line-ending char(s) are not being handled
>>>> correctly.   Don't know yet if it's rev itself or fgetws() being used
>>>> by rev that's busted.  I'll investigate further.  Thanks for the
>>>> report!
>>>
>>> This is a locale issue.  In the default Cygwin locale, rev mishandles
>>> the \x80 byte and instead of stopping with an error message it enters
>>> an infinite loop.  I'll probably report this upstream instead of
>>> working out a local fix.
>>>
>>> There is a work-around: change to the "C" locale just to run rev.
>>>     LC_ALL=C rev zzz
>>> where zzz is a file containing your four lines.  You can also run your
>>> original testcase with "rev" replaced by "LC_ALL=C rev" in both places.
>> Sorry, this is not a good workaround as it corrupts all (proper)
>> non-ASCII characters.
>> You could do e.g.
>> grep . | rev
>
> Not quite, as that just matches non-empty lines, you would have to do
> something more like `grep -o . ...`, but not sure that would do what
> you want either.
>
Ah, right, so:
egrep -e "(^$|.)" | rev
or maybe there is some more suitable tool.

> The correct approach should be to match the execution locale to the
> file locale, for example, `LC_ALL=...UTF-8 rev ...` which should
> produce the expected results.
That's not the point. You can never be sure that there is no stray
wrong-encoded byte in your files, and rev should definitely not
endless-loop in that case.