filenames with characters that have the high bit set

David Byron dbyron@dbyron.com
Tue Mar 16 20:52:00 GMT 2010


> > > > $ echo $LC_ALL
> > > > en_US
> > >
> > > Hang on, where did that come from?

It was in my environment.  My apologies for being dense.

> > I unset LC_ALL and...
> 
> Where?

I unset LC_ALL in bash, which was the wrong place.

> > Now ls foo<tab> adds the actual accented character to
> > the command line, but when I press return I get:
> >
> > ls: cannot access foo<a gray box>: No such file or directory

And of course this works now.  Sorry for the trouble.

> > I still get the right answer from test -f, when using
> > the shell builtin.  /usr/bin/test tells me the file
> > doesn't exist.
> 
> .. and that.

As does this, as long as I use the same encoding I used to originally create
the file which is totally fine.

> > > The \x18 scheme is only used for codepoints that can
> > > not be represented in the selected character set, yet
> > > U+00E9 can be represented CP1252. By definition, any
> > > Unicode codepoint can be represented in UTF-8, so the
> > > \x18 scheme is never used when that is selected.
> > >
> > > To enable C-style backslash interpretation, you need
> > > to use $'...' quoting.
> >
> > I now see the bash man page explains this.  Must have
> > missed it the first time.  The above paragraphs with
> > some examples (where \x18 is needed and where it isn't)
> > added to
> >
http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual
> > would have gotten me farther before posting.
> 
> But what I said is explained there already:

I suppose, but the point about \x18 not working with a character set that
represents the desired codepoint wasn't clear.  Nor was the bash syntax for
using \x in general.  It's in the bash man page and not cygwin-specific, but
an example showing the gory details would have helped me at least.

> > And finally here are the steps that illustrate what's going on.
> >
> > $ touch $'\x18'; echo $?
> > 0
> >
> > ls shows a file named up-arrow (0x18):
> 
> What do you mean by up-arrow? I'm getting a question mark, because
> that's what ls prints for non-printable characters by default. You can
> choose various quoting styles using the --quoting style option.

I mean the uparrow that ls prints with --show-control-chars.  Another
important omission on my part.  Doh!

> Yep, but that's a bash vs ls issue rather than a Cygwin
> one. You'd get the same on Linux. But if you use control
> characters in filenames, you better know what you're doing
> anyway. Some argue that it shouldn't be allowed in the
> first place, e.g.
> http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

Thanks for the link.  I don't typically use control characters in filename.
Just an example.

> > $ mkshortcut -n shortcut$'\xC3\xA9' plain; echo $?
> > $ readshortcut shortcut$'\xE9'
> 
> I'm afraid these aren't yet Unicode-ready, i.e. they still use Windows
> "ANSI" APIs.

Guess it's time to roll up my sleeves and write a patch.

-DB


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple



More information about the Cygwin mailing list