utf-8 and cygwin

Gregg Tavares unison@greggman.com
Fri Dec 28 00:47:00 GMT 2007


Hello, I'm new to the list and I hope I can be helpful.

I got here by trying to get rsync and then unison to work to sync my music files which contain lots of Japanese filenames between fc6 and XP

I narrowed the problems down to 2 things I think.

#1) no utf-8 support in cygwin

I see that someone says they are working on it for a future release but a search also brought up this patch

http://www.okisoft.co.jp/esc/utf8-cygwin/

which already does that. I thought maybe whoever is working on future utf-8 support might want to look at that.

after trying that out both unison and rsync started working with out a recompile for short filenames which brings up the second issue

#2) The filename size limit in cygwin is too short.

Unfortunately the names had to be pretty short. I think the issue is both that cygwin has a maximum filename limit that is too short? And that secondly, whatever it is set to, UTF-8 names will be longer than that limit. If I remember correctly, on unicode character can end up being up to 4 bytes in UTF-8. That means a for example a typical MP3 Japanese filename stored by Album-Name/Song-Name after being expanded to UTF-8 will easily be larger than 255 bytes. The check for size overflow comes before the UTF-8 is converted to widebyte UTF-16.  I believe typically one Japanese character will be 4 bytes in UTF-8 so for example a Japanese 120 unicode character path could be easily 480 bytes of UTF-8. 

The question I have then is I'm not that familiar with the cygwin source or the issues involved in changing the filename limit

For example MAXPATHLEN is defined in winsup/cygwin/include/sys/param.h as (260 -1)
and PATH_MAX is defined in winsup/cygwin/includes/limits.h as 260
and NAME_MAX is defined also in winsup/cygwin/includes/limits.h as 255
and there's even _POSIX_PATH_MAX  255

It seems like in order to get UTF-8 to work all of those have to change by 4x.  Except maybe the _POSIX_PATH_MAX although even that by name seems like it should be changed.  Is that the MAX for POSIX is is that the MIN for POSIX?

I'm not sure how I can help. If someone is already intergrating UTF-8 support I don't want to step on any toes. The most I can do then is help with testing. Otherwise, I was going to suggest adding the patches above and increasing those limits if they won't break anything.

Thoughts? Suggestions?



More information about the Cygwin-developers mailing list