readdir() returns inaccessible name if file was created with invalid UTF-8

Mark Liam Brown brownmarkliam@gmail.com
Tue Sep 17 10:42:24 GMT 2024


On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin
<cygwin@cygwin.com> wrote:
>
> Christian Franke via Cygwin wrote:
> > Thomas Wolff via Cygwin wrote:
> >>
> >> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
> >>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
> >>>> If a file name contains an invalid (truncated) UTF-8 sequence, open()
> >>>> does not refuse to create the file. Later readdir() returns a
> >>>> different name which could not be used to access the file.
> >>>>
> >>>> Testcase with U+1F321 (Thermometer):
> >>>>
> >>>> $ uname -r
> >>>> 3.5.4-1.x86_64
> >>>>
> >>>> $ printf $'\U0001F321' | od -A none -t x1
> >>>>  f0 9f 8c a1
> >>>>
> >>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
> >>>>
> >>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
> >>>>
> >>>> $ touch 'file3-'$'\xf0\x9f\x8c'
> >>>>
> >>>> $ ls -1
> >>>> ls: cannot access 'file2-.?ext': No such file or directory
> >>>> ls: cannot access 'file3-': No such file or directory
> >>>> 'file1-'$'\360\237\214\241''.ext'
> >>>> file2-.?ext
> >>>> file3-
> >>> I don't reproduce this.
> >
> > Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto'
> > which needs to call stat(). Plain 'ls' does not, so the errors do not
> > occur then.
> >
> >
> >>>
> >>> While the file name gets mangled, all resulting file names are valid
> >>> and
> >>> listed:
> >>> In file2 the sequence is turned into U+17B3 but exchanged with the dot.
> >>> In file3 the same sequence is just dropped.
> >>> $ ls -1|cat
> >>> file1-🌡.ext
> >>> file2-.ឳext
> >>> file3-
> >>>
> >>> However, ls file2* fails, as does ls *.
> >> On the other hand, ls file3- fails too, so some mapping error occurs
> >> internally.
> >> Also, the files cannot be deleted from cygwin (need to use cmd).
> >
> > 'rm' using the original names works for file2-..., but not for file3-...
> >
> > $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
> > removed 'file2-'$'\360\237\214''.ext'
> >
> > $ rm -v 'file3-'$'\xf0\x9f\x8c'
> > rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
> >
>
> Further tests suggest that the problem only occurs with:
> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16
> 'high surrogate' range (0xD800..0xDBFF).

Makes perfect sense, the Windows kernel uses UTF16 internally.

Mark
-- 
IT Infrastructure Consultant
Windows, Linux


More information about the Cygwin mailing list