Grepping Unicode files?

Eric Blake eblake@redhat.com
Thu May 14 17:14:00 GMT 2015


On 05/14/2015 10:32 AM, Vince Rice wrote:

> locale run from a cmd.exe session says that everything is “C.UTF-8”, while locale run from mintty says that everything is en_US.UTF-8. A “which” in both cases shows that the locale being run is cygwin’s, so I assume mintty does something slightly differently than the normal console? I don’t even know if there’s a difference. (Have I mentioned I don’t know anything about all of this?)
> 
> From cmd.exe:
> LANG=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_ALL=

That's because all programs default to C unless told otherwise; from
cmd, there is nothing stating otherwise, as each cygwin command is the
first process in its own tree of processes.

> 
> From mintty
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_ALL=

mintty is a cygwin process, AND it sets your locale variables to match
your Windows locale, then all other processes are children of mintty and
get the preferred locale settings by default.  Of course, if you don't
like mintty's defaults, you can set up your shell initialization scripts
to change it to your preference.

> 
> Now, pardon my continued ignorance, but which of those variables needs to be set to UTF16 in order for grep to work? And I assume it (they?) should be set to en_US.UTF-16?

None.  UTF16 is not a valid locale.  It is a valid encoding (wide
character), but locales must operate on multi-byte sequences, not wide
characters.  So you HAVE to convert from wide character to multi-byte
before you can do anything that requires a locale to work correctly.

> 
> Thanks to everyone for your help. I think you’ve all confirmed this isn’t cygwin-specific, but I couldn’t find anything even searching generically (“grep unicode” and now “grep utf16”). I did finally find an external reference to iconv, but if grep is supposed to be handle this natively, I haven’t been able to find much on how to do it.

grep cannot handle UTF16 natively.  iconv exists to do encoding
transformations, so that the rest of the system can live in multi-byte
world instead of worrying about wide-character encodings.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 604 bytes
Desc: OpenPGP digital signature
URL: <http://cygwin.com/pipermail/cygwin/attachments/20150514/09dc7a70/attachment.sig>


More information about the Cygwin mailing list