This is the mail archive of the docbook-apps@lists.oasis-open.org mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Character encoding problems in text files included withSaxon extensions


Hi,

newer releases of XSL stylesheets contain extensions for Saxon, which
are able to include external text files, generate callouts and so on. 

I'm heavily using external file inclusion mechanism, because my
documents contain a lot of examples in XML, XSLT, and other languages
which are quite often modified or updated. Inclusion feature is
available in DSSSL stylesheet for a long time. Before Saxon extensions
were published by Norm I have used my own extension for XT. When I
switched to Saxon extension I found one problem with current
implementation.

File reading extension (class Text.java) is able to read files in UTF-8
as it uses DataInputStream. If you have your included files in UTF-8 or
ASCII everything works fine. I must use accented characters in my
documents and it is convenient for me to use single byte encodings like
iso-8859-2 and windows-1250 rather then UTF-8. Using current
implementation incorrectly interpretes non-ASCII characters because
their codes are different in single bytes encoding and in UTF-8.

>From my point of view, most user have stored their files in system
default encoding so it would be more appropriate to assume that included
files are in this system encoding rather than in UTF-8. I would like to
know, in what encoding most of DocBook users store their externaly
included files. If majority of them uses system encoding (this is
probably ASCII or ISO Latin 1 for most English speaking authors) rather
then UTF-8, it would be useful to use InputStreamReader instead of
DataInputStream. InputStreamReader automatically converts content of
file from system encoding to Java Unicode characters. 

In addition to default usage of system encoding, we could provide some
mechanism how to specify encoding of included file. InputStreamReader is
able to convert files from many encodings, so adding some attribute,
notation or parameter to DocBook source would be quite easy. E.g.

<inlinegraphics format="linespecific"
fileref="example_with_russian_comments.java;charset=iso-8859-5"/>

or

<inlinegraphics format="linespecific"
fileref="example_with_russian_comments.java" role="charset=iso-8859-5"/>

or

<inlinegraphics format="linespecific;charset=iso-8859-5"
fileref="example_with_russian_comments.java"/>

For now, I'm using modified version of Norm's extension. Of course, it
would be more convenient for me to use standard version which can deal
with files in other than UTF-8 encoding.

So what is your opinion?

					Jirka

-----------------------------------------------------------------
  Jirka Kosek  	                     
  e-mail: jirka@kosek.cz
  http://www.kosek.cz

------------------------------------------------------------------
To unsubscribe from this elist send a message with the single word
"unsubscribe" in the body to: docbook-apps-request@lists.oasis-open.org


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]