sed doesn't like LANG= anymore

Thu May 20 16:38:00 GMT 2010

Am 20.05.2010 18:05, schrieb Andy Koppe:
> On Thursday, May 20, 2010, Jurriaan wrote:
>    
>> A very long sed script that's been working for ages (back from the 1.5
>> age) here has stopped working.
>>
>> It turned out sed doesn't like some strings anymore when environment
>> variable LANG is empty. With LANG=ASCII, there are no problems.
>>
>> The actual text in the SED command is shown below as spaces, but it's a
>> Swedish a with a small o on top of it, like this:
>>
>> sed -e"s/@a/ a/g;"
>>
>> where a is character 0xe5.
>>
>> Running with LANG=ASCII works, with LANG empty I get 'unterminated `s'
>> command' from sed (which confused me for a while).
>>      
> With empty LANG you're using the default UTF-8 encoding, where that
> 0xe5 byte constitutes an incomplete character. You need to either run
> with a LANG setting that fits your script, e.g. C.ISO-8859-1, or
> convert your script to UTF-8. I'm puzzled as to why LANG=ASCII would
> have worked, since that's not a valid setting.
>    
With LANG=anything-unknown, the charmap is set to ASCII, so it works (as 
there is at least no multibyte character then).
Considering the described effect, I doubt that a UTF-8 decoder should 
swallow an ASCII byte after an incomplete UTF-8 sequence;
it should rather stop at the last UTF-8 sequence byte, and consider any 
subsequent initial UTF-8 or ASCII byte as a new character.
I guess the script would still work on Linux (can't try right now, 
sorry) even in a "wrong" locale, so I think something should be fixed in 
the newlib conversion functions here.
------
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple