utf-8 and cygwin

Gregg Tavares unison@greggman.com
Fri Dec 28 20:25:00 GMT 2007


Excuse me if I'm naieve,

but, looking at the code there are some issues if you want to support the 32k character thing.

#1 is that that NT/XP limit is 32000 UTF-16  wide characters.  Expanded to UTF-8 that makes the longest name 128k so if you really want this to work for 32K character names PATH_MAX is going to have to be 128K.

> > #2) The filename size limit in cygwin is too short.
> 
> This is a Windows limitation when using the ANSI versions of the Win32
> api. Windows limits paths to 260 (including the leading drive
> letter+colon+backslash and null) and there's nothing Cygwin can do about
> this. Just changing some defines would not make longer paths work.

No, changing the defines and at the last second doing a MultiByteToWideChar and then calling the correct fooW function would though.

> > http://www.okisoft.co.jp/esc/utf8-cygwin/
>
> This patch has already been studied and rejected for inclusion in
> Cygwin. The reason is that it just adds some wrappers around certain
> functions, but it does not cause Cygwin to actually use Unicode
> throughout.
 
I'd like to know more about that it means for Cygwin to actually use Unicode. The whole point of utf-8 in general was that most code would just work as is. It's an 8bit code with no conflicts with any ASCII.  So, you pass the names around as utf-8. The rest of the code, as long as the buffers are big enough will work. There's no conflicts with any puctuation so filenames are not going to get split in the wrong place or anything like that. Then finally when you call some win32api function that takes a string you call MultiByteToWideChar just before calling the desired function and visa versa.

The only issues I'm aware of are things like upper/lower case functions and possibly regular expression like functions. In general those will still work as well though with no changes.

I've spent the last 4 hours trying to find some discussion of these issues in the ML but I guess my search kung-fu is weak unless there is some other place I should be looking for discussion of these issues.


----- Original Message ----
> From: Brian Dessent <brian@dessent.net>
> To: cygwin-developers@cygwin.com
> Sent: Friday, December 28, 2007 6:35:13 AM
> Subject: Re: utf-8 and cygwin
> 
> Gregg Tavares wrote:
> 
> > I see that someone says they are working on it for a future release but a
> search also brought up this patch
> > 
> > http://www.okisoft.co.jp/esc/utf8-cygwin/
> > 
> > which already does that. I thought maybe whoever is working on future utf-8
> support might want to look at that.
> 
> This patch has already been studied and rejected for inclusion in
> Cygwin. The reason is that it just adds some wrappers around certain
> functions, but it does not cause Cygwin to actually use Unicode
> throughout.
> 
> > after trying that out both unison and rsync started working with out a
> recompile for short filenames which brings up the second issue
> > 
> > #2) The filename size limit in cygwin is too short.
> 
> This is a Windows limitation when using the ANSI versions of the Win32
> api. Windows limits paths to 260 (including the leading drive
> letter+colon+backslash and null) and there's nothing Cygwin can do about
> this. Just changing some defines would not make longer paths work.
> 
> If you switch to the wide character versions of the Win32 API you can
> handle paths as long as 32000 Unicode characters. And that is why
> Corinna has been slowly refactoring the whole code base so that all
> paths are stored and manipulated internally as Unicode strings, which
> leads to calling the W versions of the win32api (or the Native version,
> since we've decided to no longer support Win9x) without any wrappers.
> 
> What's in CVS now is a work in progress. It is not finished, and so you
> can't expect it to do anything differently yet. It's a lot of work to
> refactor every bit of code that touches a file/pathname to handle
> unicode.
> 
> Brian
>



More information about the Cygwin-developers mailing list