readdir() returns inaccessible name if file was created with invalid UTF-8

Thu Sep 19 17:30:05 GMT 2024

Brian Inglis via Cygwin wrote:
> On 2024-09-19 07:27, Christian Franke via Cygwin wrote:
>> Mark Liam Brown via Cygwin wrote:
>>> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin
>>> <cygwin@cygwin.com> wrote:
>>>> Christian Franke via Cygwin wrote:
>>>>> Thomas Wolff via Cygwin wrote:
>>>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
>>>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
>>>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, 
>>>>>>>> open()
>>>>>>>> does not refuse to create the file. Later readdir() returns a
>>>>>>>> different name which could not be used to access the file.
>>>>>>>>
>>>>>>>> Testcase with U+1F321 (Thermometer):
>>>>>>>>
>>>>>>>> $ uname -r
>>>>>>>> 3.5.4-1.x86_64
>>>>>>>>
>>>>>>>> $ printf $'\U0001F321' | od -A none -t x1
>>>>>>>>   f0 9f 8c a1
>>>>>>>>
>>>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
>>>>>>>>
>>>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>>>>>
>>>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
>>>>>>>>
>>>>>>>> $ ls -1
>>>>>>>> ls: cannot access 'file2-.?ext': No such file or directory
>>>>>>>> ls: cannot access 'file3-': No such file or directory
>>>>>>>> 'file1-'$'\360\237\214\241''.ext'
>>>>>>>> file2-.?ext
>>>>>>>> file3-
>>>>>>> I don't reproduce this.
>>>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto'
>>>>> which needs to call stat(). Plain 'ls' does not, so the errors do not
>>>>> occur then.
>>>>>
>>>>>
>>>>>>> While the file name gets mangled, all resulting file names are 
>>>>>>> valid
>>>>>>> and
>>>>>>> listed:
>>>>>>> In file2 the sequence is turned into U+17B3 but exchanged with 
>>>>>>> the dot.
>>>>>>> In file3 the same sequence is just dropped.
>>>>>>> $ ls -1|cat
>>>>>>> file1-🌡.ext
>>>>>>> file2-.ឳext
>>>>>>> file3-
>>>>>>>
>>>>>>> However, ls file2* fails, as does ls *.
>>>>>> On the other hand, ls file3- fails too, so some mapping error occurs
>>>>>> internally.
>>>>>> Also, the files cannot be deleted from cygwin (need to use cmd).
>>>>> 'rm' using the original names works for file2-..., but not for 
>>>>> file3-...
>>>>>
>>>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>> removed 'file2-'$'\360\237\214''.ext'
>>>>>
>>>>> $ rm -v 'file3-'$'\xf0\x9f\x8c'
>>>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
>>>>>
>>>> Further tests suggest that the problem only occurs with:
>>>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
>>>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16
>>>> 'high surrogate' range (0xD800..0xDBFF).
>>> Makes perfect sense, the Windows kernel uses UTF16 internally.
>>
>>
>> Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> 
>> UTF-16 mappings. This makes no sense:
>>
>> $ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on 
>> NTFS
>>
>> $ strace ls -F
>> ...
>> ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" 
>> > "file-\xE2\x9E\xB3.ext")
>> ...
>>   ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...)
>> ...
>> ls: cannot access 'file-?.ext': No such file or directory
>> file-?.ext
>>
>> $ rm -v 'file-'$'\xed\xa0\x80''.ext'
>> removed 'file-'$'\355\240\200''.ext'
>>
>> The UTF-8 sequence returned by readdir() decodes to U+27B3 
>> (White-Feathered Rightwards Arrow).
>>
>>
>> This could be fixed by handling UTF-8 of the surrogate range similar 
>> to other invalid sequences: Map each invalid byte to unicode range 
>> U+FF80 to U+FFFF. This works as expected if the above UTF-8 sequence 
>> is truncated:
>>
>> $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" 
>> on NTFS
>>
>> $ ls -F
>> 'file-'$'\355\240''.ext'
>
> Surrogates halves are invalid for UTF-8 encoding; they should be first 
> be encoded as a valid UTF-16 code point.
> The encoder should just fail if it encounters any invalid sequence!
> Handling surrogates or other invalid values as anything other than 
> invalid turns the encoding into what has been called WTF-8 where W may 
> be for Windows! ;^>

:-)

I guess the idea behind Cygwin's filename mapping was to emulate Linux 
behavior as far as possible. AFAICS, Linux accepts any nonempty byte 
string without slash as a plain filename and leaves the interpretation 
(UTF-8?) to the userland.

Cygwin maps 0x20..0x7f and valid UTF-8 sequences to UTF-16. Control 
chars and bytes from invalid UTF-8 sequences are mapped to the U+F0xx 
range. It should handle UTF-8 sequences which lead to the surrogate 
range the same way but currently does not.