This is the mail archive of the guile@cygnus.com mailing list for the guile project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
The Guile team is beginning to work on cleaning up Guile's support for wide character sets, as part of a general push for i18n. I would like to hear your thoughts on which directions we should take. At the moment, Guile Scheme has complex, and insufficient support for wide character sets, which I won't go into. We are considering redesigning the string representation and the I/O ports. If one uses variable-width characters, one has a host of problems: if the interpreter attempts to conceal the fact that characters are variable-width, it is difficult to make string-length, string-ref, and string-set! work in constant time; string-set! might change the length of the string; and so on. If the interpreter exposes the variable-width representation to the programmer, this just passes the buck, making the programmer responsible for implementing the encoding. Neither of these tactics are attractive, so variable-width characters seem problematic. If one tries to use multiple character encodings in memory, then one should provide for transparent conversion when strings are compared, combined, hashed, etc. This sounds like a bad idea, too. The MULE character representation seems like a bad idea to me, because it has all the problems of both of the above techniques; its only advantage is that it saves space if one uses only 8-bit characters. Thus, my current inclinations: - Use 16-bit characters in strings throughout. - Prescribe the use of Unicode throughout. - Provide functions to convert between Unicode character strings all other widely-used formats: UTF-8, UTF-7, Latin-1, and the JIS variants, as well as anything else people would like to contribute. - Provide a separate "byte array" type, for applications which genuinely want this. We may implement the 16-bit character strings in odd ways that save space when the upper bytes of all the characters are zero, but that's a separate issue. What I'm most interested in is your advice regarding character sets and (externally visible) text representations. How would you recommend we go about supporting wide character sets? What do you think of Unicode?