[newlib-cygwin/main] Cygwin: _sys_mbstowcs: fix handling invalid 4-byte UTF-8 sequences
Corinna Vinschen
corinna@sourceware.org
Thu Jul 24 10:30:05 GMT 2025
https://sourceware.org/git/gitweb.cgi?p=newlib-cygwin.git;h=1463b41d403e861e4033387cdc71006e1664203a
commit 1463b41d403e861e4033387cdc71006e1664203a
Author: Corinna Vinschen <corinna@vinschen.de>
AuthorDate: Wed Jul 23 22:42:01 2025 +0200
Commit: Corinna Vinschen <corinna@vinschen.de>
CommitDate: Thu Jul 24 11:23:27 2025 +0200
Cygwin: _sys_mbstowcs: fix handling invalid 4-byte UTF-8 sequences
When a 4 byte utf-8 sequence has an invalid 4th byte, it's actually
an invalid 3 byte sequence. In this case we already generated the
high surrogate and only realize the problem when byte 4 doesn't
match.
At this point _sys_mbstowcs transposes the invalid 4th byte into the
private use area.
This is wrong. The invalid byte sequence here is the 3 byte sequence
already converted to a high surrogate, not the trailing 4th byte.
Fix this by backtracking to the start of the broken sequence and
overwrite the already written high surrogate with a sequence of
the original three bytes transposed to the private use area. Reset
the mbstate and restart normal conversion at the non-matching
4th byte, which might start a new multibyte sequence.
The resulting wide-char string can be converted back to multibyte
and back again to wide-char, and the result will be identical, even
if the multibyte sequence differs from the original sequence.
Fixes: e44b9069cd227 ("* strfuncs.cc (sys_cp_mbstowcs): Treat src as unsigned char *. Convert failure of f_mbtowc into a single malformed utf-16 value.")
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
Diff:
---
winsup/cygwin/strfuncs.cc | 51 +++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 47 insertions(+), 4 deletions(-)
diff --git a/winsup/cygwin/strfuncs.cc b/winsup/cygwin/strfuncs.cc
index cb7911c6b83d..caaf6b786295 100644
--- a/winsup/cygwin/strfuncs.cc
+++ b/winsup/cygwin/strfuncs.cc
@@ -1071,6 +1071,7 @@ _sys_mbstowcs (mbtowc_p f_mbtowc, wchar_t *dst, size_t dlen, const char *src,
{
wchar_t *ptr = dst;
unsigned const char *pmbs = (unsigned const char *) src;
+ unsigned const char *got_high_surrogate = NULL;
size_t count = 0;
size_t len = dlen;
int bytes;
@@ -1142,16 +1143,58 @@ _sys_mbstowcs (mbtowc_p f_mbtowc, wchar_t *dst, size_t dlen, const char *src,
Invalid bytes in a multibyte sequence are converted to
the private use area which is already used to store ASCII
- chars invalid in Windows filenames. This technque allows
+ chars invalid in Windows filenames. This technique allows
to store them in a symmetric way. */
- bytes = 1;
- if (dst)
- *ptr = L'\xf000' | *pmbs;
+
+ /* Special case high surrogate: if we already converted the first
+ 3 bytes of a sequence to a high surrogate, and only then encounter
+ a non-matching forth byte, the sequence is simply cut short. In
+ that case not the currently handled 4th byte is the invalid
+ sequence, but the 3 bytes converted to the high surrogate. So we
+ have to backtrack to the high surrogate and convert it to a
+ sequence of bytes in the private use area. Next, reset the
+ mbstate and retry to convert starting at the current byte. */
+ if (got_high_surrogate)
+ {
+ if (dst)
+ {
+ --ptr;
+ *ptr++ = L'\xf000' | *got_high_surrogate++;
+ /* we know len > 0 at this point */
+ *ptr++ = L'\xf000' | *got_high_surrogate++;
+ }
+ --len;
+ if (len > 0)
+ {
+ if (dst)
+ *ptr++ = L'\xf000' | *got_high_surrogate++;
+ --len;
+ }
+ count += 2; /* Actually 3, but we already counted one when
+ generating the high surrogate. */
+ memset (&ps, 0, sizeof ps);
+ continue;
+ }
+ /* Never convert ASCII NUL */
+ if (*pmbs)
+ {
+ bytes = 1;
+ if (dst)
+ *ptr = L'\xf000' | *pmbs;
+ }
memset (&ps, 0, sizeof ps);
}
+ got_high_surrogate = NULL;
if (bytes > 0)
{
+ /* Check if we got the high surrogate from a UTF-8 4 byte sequence.
+ This is used above to handle an invalid 4 byte sequence cut short
+ at byte 3. */
+ /* FIXME: do we need an equivalent check for gb18030? */
+ if (bytes == 3 && ps.__count == 4 && f_mbtowc == __utf8_mbtowc)
+ got_high_surrogate = pmbs;
+
pmbs += bytes;
nms -= bytes;
++count;
More information about the Cygwin-cvs
mailing list