readdir() returns inaccessible name if file was created with invalid UTF-8

Cedric Blancher cedric.blancher@gmail.com
Thu Sep 19 14:59:42 GMT 2024


On Thu, 19 Sept 2024 at 16:46, Brian Inglis via Cygwin
<cygwin@cygwin.com> wrote:
>
> On 2024-09-19 07:27, Christian Franke via Cygwin wrote:
> > Mark Liam Brown via Cygwin wrote:
> >> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin
> >> <cygwin@cygwin.com> wrote:
> >>> Christian Franke via Cygwin wrote:
> >>>> Thomas Wolff via Cygwin wrote:
> >>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
> >>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
> >>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open()
> >>>>>>> does not refuse to create the file. Later readdir() returns a
> >>>>>>> different name which could not be used to access the file.
> >>>>>>>
> >>>>>>> Testcase with U+1F321 (Thermometer):
> >>>>>>>
> >>>>>>> $ uname -r
> >>>>>>> 3.5.4-1.x86_64
> >>>>>>>
> >>>>>>> $ printf $'\U0001F321' | od -A none -t x1
> >>>>>>>   f0 9f 8c a1
> >>>>>>>
> >>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
> >>>>>>>
> >>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
> >>>>>>>
> >>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
> >>>>>>>
> >>>>>>> $ ls -1
> >>>>>>> ls: cannot access 'file2-.?ext': No such file or directory
> >>>>>>> ls: cannot access 'file3-': No such file or directory
> >>>>>>> 'file1-'$'\360\237\214\241''.ext'
> >>>>>>> file2-.?ext
> >>>>>>> file3-
> >>>>>> I don't reproduce this.
> >>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto'
> >>>> which needs to call stat(). Plain 'ls' does not, so the errors do not
> >>>> occur then.
> >>>>
> >>>>
> >>>>>> While the file name gets mangled, all resulting file names are valid
> >>>>>> and
> >>>>>> listed:
> >>>>>> In file2 the sequence is turned into U+17B3 but exchanged with the dot.
> >>>>>> In file3 the same sequence is just dropped.
> >>>>>> $ ls -1|cat
> >>>>>> file1-🌡.ext
> >>>>>> file2-.ឳext
> >>>>>> file3-
> >>>>>>
> >>>>>> However, ls file2* fails, as does ls *.
> >>>>> On the other hand, ls file3- fails too, so some mapping error occurs
> >>>>> internally.
> >>>>> Also, the files cannot be deleted from cygwin (need to use cmd).
> >>>> 'rm' using the original names works for file2-..., but not for file3-...
> >>>>
> >>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
> >>>> removed 'file2-'$'\360\237\214''.ext'
> >>>>
> >>>> $ rm -v 'file3-'$'\xf0\x9f\x8c'
> >>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
> >>>>
> >>> Further tests suggest that the problem only occurs with:
> >>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
> >>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16
> >>> 'high surrogate' range (0xD800..0xDBFF).
> >> Makes perfect sense, the Windows kernel uses UTF16 internally.
> >
> >
> > Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> UTF-16
> > mappings. This makes no sense:
> >
> > $ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on NTFS
> >
> > $ strace ls -F
> > ...
> > ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" >
> > "file-\xE2\x9E\xB3.ext")
> > ...
> >   ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...)
> > ...
> > ls: cannot access 'file-?.ext': No such file or directory
> > file-?.ext
> >
> > $ rm -v 'file-'$'\xed\xa0\x80''.ext'
> > removed 'file-'$'\355\240\200''.ext'
> >
> > The UTF-8 sequence returned by readdir() decodes to U+27B3 (White-Feathered
> > Rightwards Arrow).
> >
> >
> > This could be fixed by handling UTF-8 of the surrogate range similar to other
> > invalid sequences: Map each invalid byte to unicode range U+FF80 to U+FFFF. This
> > works as expected if the above UTF-8 sequence is truncated:
> >
> > $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" on NTFS
> >
> > $ ls -F
> > 'file-'$'\355\240''.ext'
>
> Surrogates halves are invalid for UTF-8 encoding; they should be first be
> encoded as a valid UTF-16 code point.
> The encoder should just fail if it encounters any invalid sequence!
> Handling surrogates or other invalid values as anything other than invalid turns
> the encoding into what has been called WTF-8 where W may be for Windows! ;^>
>
Nope, the WTF-8 means "What the F*ck-8"!

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur


More information about the Cygwin mailing list