readdir() returns inaccessible name if file was created with invalid UTF-8

Christian Franke Christian.Franke@t-online.de
Sun Jun 29 17:47:29 GMT 2025


Christian Franke wrote:
> Corinna Vinschen via Cygwin wrote:
>> On Jun 27 15:32, Christian Franke via Cygwin wrote:
>>> $ touch $'t-\xef\x80\x80'
>>> The name mapping is:
>>> "t-\xEF\x80\x80" -(open, ...)-> L"t-\xDB59" -(readdir)-> "t-"
>> Did you copy/paste this from the old mail, by any chance?
>
> Sorry, I accidentally mixed two cases with same readdir() result:
>
> "t-\xEF\x80\x80" -(open, ...)-> L"t-\xF000" -(readdir)-> "t-"
> "t-\xED\xAD\x99' -(open, ...)-> L"t-\xDB59" -(readdir)-> "t-"
>
> $ touch $'t-\xed\xad\x99'
> $ touch $'t-\xef\x80\x80'
> $ ls | uniq -c
>       2 t-
>
> Does no longer occur in 3.7.0-0.165.g1b60f4861b70 but see below.
> ...
>> ...
>> I'll apply the patch shortly.
>
> $ touch $'t-\xed\xad\x90'
> $ touch $'t-\xed\xad\x91'
> $ touch $'t-\xed\xad\x92'
> $ touch $'t-\xed\xad\x93'
> $ touch $'t-\xed\xad\x94'
> $ ls | uniq -c
>       5 t-
>
> $ ls -s
> ls: cannot access 't-': No such file or directory
> ls: cannot access 't-': No such file or directory
> ls: cannot access 't-': No such file or directory
> ls: cannot access 't-': No such file or directory
> ls: cannot access 't-': No such file or directory
> total 0
> ? t-  ? t-  ? t-  ? t-  ? t-
>
> All results found by several runs with different seeds of the attached 
> test program have in common that the Windows path name contains an 
> invalid word in UTF-16 High Surrogate range:
>
> $ ./randnames 42
> $'t-\xEC\x9E\xB3\xEF\x82\x80\xEF\x83\xA0': access() failed, errno=2:
> $'t-\xED\xA4\xA8\x80\xE0': original path
> L"t-\xD928\xF080\xF0E0": Windows path
>
> $'t-\xEE\x9E\xB3\xEF\x83\xA1': access() failed, errno=2:
> $'t-\xED\xA6\xB0\xE1': original path
> L"t-\xD9B0\xF0E1": Windows path
> ...
> $'t-\xE7\xBE\xB3\xEF\x82\xB3': access() failed, errno=2:
> $'t-\xED\xA2\x96\xB3': original path
> L"t-\xD896\xF0B3": Windows path
>

A closer look reveals two problems:

1.) A lone high surrogate is not encoded correctly. Could be fixed with 
this patch:
https://cygwin.com/pipermail/cygwin-patches/2025q2/014001.html

2.) A high surrogate at the very end of the string is not encoded at 
all. A fix would require to enhance the interface between __*_wctomb() 
and the outer functions. The outer loop would need to call the function 
again after L'\0' occurred.

BTW: if the file name consists only of a single high surrogate, an 
interesting corner case of readdir() is visible:

$ echo foo >$'\uD876' # Windows name: L"\xD876"
$ cat $'\uD876'
foo
$ ls
$ ls -a | uniq -c
       1 .
       2 ..

-- 
Regards,
Christian



More information about the Cygwin mailing list