This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: National Language Collating Sequences and Index Generation


"W. Eliot Kimber" <eliot@isogen.com> wrote:
> I have to generate back-of-the-book indexes for many national languages,
> including Arabic, Hebrew, Thai, Simplified Chinese, Traditional Chinese,
> Korean, and Japanese. I've successfully adapted the Docbook index
> generation code to produce the basic index, but now I'm faced with the
> challenge of both doing correct sorting for these languages and
> generating the appropriate index groups.

That's an interesting topic and a real, already acknowledged but
in general not quite solved problem.
In XSLT 1.0, xsl:sort sorts strings lexically by Unicode code point
number, IIRC. Localized sorting by a single character should also
relatively easy to implement if you can get hold of the collating
sequence:

  <xsl:stylesheet ...
     xmlns:coll="my.collating.sequence"/>
  <coll:sequence>
    <char char="A" number="1"/>
    <char char="B" number="2"/>
   ...
  </coll:sequence>
  <xsl:variable name="collseq" select="document('')/*/coll:sequence"/>
  ...
    <xsl:for-each select="$items">
      <xsl:sort select="$collseq[@char=substring(current()/name,1,1)]/@number"/>

You can try to add
      <xsl:sort select="$collseq[@char=substring(current()/name,2,1)]/@number"/>
and so on for more compete lexical sorting.
It could be of some use that you could define fractional numbers for
the sorting keys:
    <char char="A" number="1"/>
    <char char="&Auml;" number="1.1"/> <!-- sorry for the entity :-) -->
    <char char="a" number="1.5"/>
The caveats are that you better have a complete collating sequence,
and that you shouldn't expect a great performance, especially if you
add a lot of sort clauses. There is also the possibility that you run
afoul unexpected character normalisation issues, users could expect
that &#xE4; and &#x61;&#x0308; are interchangable (at least i think so).

In XSLT/XPath 2.0, you can have named collating sequences, but you
shouldn't expect the ones you need are provided by the runtime
system :-((((

HTH
J.Pietschmann

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]