readdir() returns inaccessible name if file was created with invalid UTF-8

Corinna Vinschen corinna-cygwin@cygwin.com
Fri Jun 27 10:30:47 GMT 2025


Hi Christian,

On Jun 26 19:07, Christian Franke via Cygwin wrote:
> Corinna Vinschen via Cygwin wrote:
> > On Jun 25 16:59, Christian Franke via Cygwin wrote:
> > > On Sun, 15 Sep 2024 19:47:11 +0200, Christian Franke wrote:
> > > > If a file name contains an invalid (truncated) UTF-8 sequence, open()
> > > > does not refuse to create the file. Later readdir() returns a different
> > > > name which could not be used to access the file.
> > > > 
> > > > Testcase with U+1F321 (Thermometer):
> > > > 
> > > > $ uname -r
> > > > 3.5.4-1.x86_64
> > > > 
> > > > $ printf $'\U0001F321' | od -A none -t x1
> > > >   f0 9f 8c a1
> > > > 
> > > > $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
> > > > 
> > > > $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
> > > > 
> > > > $ touch 'file3-'$'\xf0\x9f\x8c'
> > > > 
> > > > $ ls -1
> > > > ls: cannot access 'file2-.?ext': No such file or directory
> > > > ls: cannot access 'file3-': No such file or directory
> > > > 'file1-'$'\360\237\214\241''.ext'
> > > > file2-.?ext
> > > > file3-
> > > > [...]
> > I don't know exactly where this happens, but the input of the
> > conversion is invalid UTF-8 because it's missing the 4th byte.
> > There's no way to represent these filenames on Windows
> > filesystems storing filenames as UTF-16 values.
> > 
> > So the problem here is that the conversion somehow misses that
> > the 4th byte is invalid and just plods forward and converts the
> > leading three bytes into the matching high surrogate value and
> > then stumbles over the conversion for the low surrogate.
> > 
> > It would be really helpful to have an STC for this problem.
> 
> With some trial and error I found a testcase for this more serious problem
> reported yesterday but not quoted above:
> 
> > > In cases like file3-... above, the converted Windows path ends with
> > > 0xF000. This suggests that this is an accidental conversion of the
> > > terminating null to the 0xF0xx range.
> > > 
> > > In some cases, the created Windows file name has random garbage
> > > behind the 0xF000. Then even Cygwin is not able to access or unlink
> > > the file after creation.
> 
> Testcase (attached):

Thanks for the testcase!

I found the problem in the newlib core function creating wchar_t from
UTF-8 input.  In case of 4 byte UTF-8 sequences, the code created the
low surrogate already after reading byte 3, without checking if byte 4
of the UTF-8 sequence is a valid byte. Hilarity ensues.

Fortunately this bug has only been introduced very recently, to wit, on
2009-03-24, a mere 16 years ago.  And it is my bug and mine alone :}

I'm just prep'ing a fix which I'll push in a minute or two.


Thanks,
Corinna


More information about the Cygwin mailing list