readdir() returns inaccessible name if file was created with invalid UTF-8

Christian Franke Christian.Franke@t-online.de
Mon Sep 16 09:49:41 GMT 2024


Christian Franke via Cygwin wrote:
> Thomas Wolff via Cygwin wrote:
>>
>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open()
>>>> does not refuse to create the file. Later readdir() returns a
>>>> different name which could not be used to access the file.
>>>>
>>>> Testcase with U+1F321 (Thermometer):
>>>>
>>>> $ uname -r
>>>> 3.5.4-1.x86_64
>>>>
>>>> $ printf $'\U0001F321' | od -A none -t x1
>>>>  f0 9f 8c a1
>>>>
>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
>>>>
>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>
>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
>>>>
>>>> $ ls -1
>>>> ls: cannot access 'file2-.?ext': No such file or directory
>>>> ls: cannot access 'file3-': No such file or directory
>>>> 'file1-'$'\360\237\214\241''.ext'
>>>> file2-.?ext
>>>> file3-
>>> I don't reproduce this.
>
> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto' 
> which needs to call stat(). Plain 'ls' does not, so the errors do not 
> occur then.
>
>
>>>
>>> While the file name gets mangled, all resulting file names are valid 
>>> and
>>> listed:
>>> In file2 the sequence is turned into U+17B3 but exchanged with the dot.
>>> In file3 the same sequence is just dropped.
>>> $ ls -1|cat
>>> file1-🌡.ext
>>> file2-.ឳext
>>> file3-
>>>
>>> However, ls file2* fails, as does ls *.
>> On the other hand, ls file3- fails too, so some mapping error occurs
>> internally.
>> Also, the files cannot be deleted from cygwin (need to use cmd).
>
> 'rm' using the original names works for file2-..., but not for file3-...
>
> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
> removed 'file2-'$'\360\237\214''.ext'
>
> $ rm -v 'file3-'$'\xf0\x9f\x8c'
> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
>

Further tests suggest that the problem only occurs with:
- incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
- complete but invalid 3 byte UTF-8 sequences which encode the UTF-16 
'high surrogate' range (0xD800..0xDBFF).



More information about the Cygwin mailing list