This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: filenames with characters that have the high bit set


> > And my ~/.inputrc contains:
> >
> > set meta-flag on
> > set convert-meta off
> > set input-meta on
> > set output-meta on
> 
> Makes plenty of sense. But note that meta-flag is a synonym for
> input-meta, so you can remove one of them.

I was just following the instructions at
http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode

> > $ echo $LC_ALL
> > en_US
> 
> Hang on, where did that come from?

When my cygwin.bat has set LANG=en_US.UTF-8, I get LANG=en_US.UTF-8 and
LC_ALL=en_US in bash.  When my cygwin.bat doesn't set LANG, I get
LC_ALL=en_US and LANG isn't set.

> LC_ALL overrides any other locale variables including
> LANG. Specifying a locale without a charset means that
> Cygwin 1.7.1 looks up your ANSI codepage. Assuming you're
> on a US system, this means you're getting CP1252, not
> UTF-8. (Note besides: Cygwin 1.7.2 changes to a
> Linux-compatible scheme for locales without explicit
> charset instead, where you'd get ISO-8859-1 instead.)

I unset LC_ALL and...

> > $ ls foo<tab>
> >
> > adds the actual accented character to the command line
> > (whether set show-all-if-ambiguous on is in ~/.inputrc
> > or not). ?Then I press return and ls prints the
> > filename.

Now ls foo<tab> adds the actual accented character to the command line, but
when I press return I get:

ls: cannot access foo<a gray box>: No such file or directory

when I pipe the error message to od -c, the gray box is octal 351 or 0xE9.

> >?Then if I go through command history and change "ls" to
> > "test -f" and add the "; echo $?" I get the right answer
> > from test.

I still get the right answer from test -f, when using the shell builtin.
/usr/bin/test tells me the file doesn't exist.

I think I can make the above more clear with the steps below.

> The \x18 scheme is only used for codepoints that can not be
> represented in the selected character set, yet U+00E9 can be
> represented CP1252. By definition, any Unicode codepoint can be
> represented in UTF-8, so the \x18 scheme is never used when that is
> selected.
>
> To enable C-style backslash interpretation, you need to use 
> $'...' quoting.

I now see the bash man page explains this.  Must have missed it the first
time.  The above paragraphs with some examples (where \x18 is needed and
where it isn't) added to
http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual
would have gotten me farther before posting.

> > $ touch "\x18"; echo $?
> > 0
> 
> Have a look in your root directory. There should be a file 
> called x18 there.

I don't see anything in my cygwin root (/) but I do see x18 in the root of
my C drive.  Thanks.

> > Can someone give me a hand coming up with a command line
> > where I can build up filenames that contain characters
> > that have the high bit set (as well as any non-ascii
> > character really)?
> 
> Just type them in. The 'US International' keyboard layout might be
> useful here. See
> http://en.wikipedia.org/wiki/Keyboard_layout#US-International.
> 
> Otherwise, use $'...', and lose the unnecessary \x18s.

And finally here are the steps that illustrate what's going on.

$ touch $'\x18'; echo $?
0

ls shows a file named up-arrow (0x18):

$ ls $'\x18' | od -c
0000000 030  \n
0000002

but if I type

$ ls<tab>
^X

which seems inconsistent.

Now for more interesting tests.  In an empty directory:

$ touch foo$'\xC3\xA9'; echo $?
0

$ touch bar$'\xE9'; echo $?
0

$ ls | od -c
0000000   b   a   r 351  \n   f   o   o 303 251  \n
0000013

$ ls foo<tab> (displays foo with an accented e)
ls: <gray box>: No such file or directory

$ ls bar<tab> (displays bar with an accented e)
bar<gray box>

$ ls bar$'\xE9'
bar<gray box>

$ ls foo$'\xE9'
ls: <gray box>: No such file or directory

where <gray box> is octal 351 (0xE9)

$ ls foo$'\xC3\xA9'
foo<accented e>

$ ls bar$'\xC3\xA9'
ls: cannot access bar<accented e>: No such file or directory

where <accented e> is octal 303 351 (0xC3A9)

All of the above sort of makes sense, though it sort of seems like both \xE9
and \xC3\xA9 could work to find both foo and bar.

$ type test
test is a shell builtin

$ test -f foo$'\xC3\xA9'; echo $?
1

$ test -f bar$'\xE9'; echo $?
1

builtin test doesn't seem to be doing the right thing here.

$ /usr/bin/test -f foo$'\xC3\xA9'; echo $?
0

$ /usr/bin/test -f bar$'\xE9'; echo $?
0

And then using the wrong encoding:

$ test -f foo$'\xE9'; echo $?
0

$ test -f bar$'\xC3\xA9'; echo $?
1

$ /usr/bin/test -f foo$'\xE9'; echo $?
1

$ /usr/bin/test -f bar$'\xC3\xA9'; echo $?
1

So there's some inconsistency here too in the builtin test.

Changing the subject here a bit, but getting to the thing that's actually
holding me up now:

$ mkshortcut -n shortcut$'\xC3\xA9' plain; echo $?
0

$ ls | od -c
0000000   s   h   o   r   t   c   u   t 303 203 302 251   .   l   n   k
0000020  \n
0000021

This doesn't seem right.

$ mkshortcut -n shortcut$'\xE9' plain; echo $?
0

$ ls | od -c
0000000   s   h   o   r   t   c   u   t 303 251   .   l   n   k  \n
0000017

And then

$ readshortcut shortcut$'\xE9'
/home/dbyron/foo/plain

$ readshortcut shortcut$'\xC3\xA9'
readshortcut: Load failed on
C:\utils\cygwin\home\dbyron\foo\shortcut<accented e>.lnk

where <accented e> is octal 303 251 (0xC3A9)

Am I right to expect readshortcut to read the shortcut when given the UTF-8
encoding in this environment?

Thanks for your help.

-DB


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]