This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Finding unique nodes in a non-sibling nodeset


In a code generation transform that I am working on, I frequently encounter
situations where I need to eliminate duplicate expressions or event calls.
The nodes with the commonality to be detected are often scattered around
different parts of a large (preprocessed) reference document that is loaded
with a document call.

Previously, I had eliminated duplicates with something of the form
 $list[not(@key1=preceding-sibling::*/@key1)]
or
 $list[not(@key1=preceding::*/@key1)]
... If I wanted to look back through the whole document.

In this situation however, the nodes to be duplicate-trimmed are

[A] Selected out of the reference document in very specific contextual
  ways (e.g. deep inside xsl:template / xsl:for-each usages)
[B] Not all sibling nodes
[C] The preceding axis can't be used since it looks at the whole
    preceding area of the document, not just my carefully selected nodes.
[D] The definition of duplication requires use of multiple node
    attributes.  i.e. needs a composite key.

Even if [D] were not true, the "preceding-sibling" axis approach would not
work because of [B] and the "preceding" axis approach would not work
because of [C].

I eventually hit on a way to solve this (since I use Saxon) using
saxon:tokenize. But I always wondered if there was a non-extension
way to do it.

What I did was build an aggregate string with delimiters from the nodes
in the set in question (in a variable called "$list"), like so ...

  <xsl:variable name="aggregate">
    <xsl:for-each select="$list">
      <xsl:value-of select="concat(@key1,'/',@key2)" />
      <xsl:if test="not(position()=last())"><xsl:text>#</xsl:text></xsl:if>
    </xsl:for-each>
  </xsl:variable>

Then use tokenize to get a node set ...

 <xsl:variable name="list4" select="saxon:tokenize($aggregate,'#')"/>

And eliminate the duplicates the standard (?) way with

 <xsl:variable name="list4NoDups" select="$list4[not(.=preceding-sibling::*)]"/>

I'm then able to process the node subset I was trying to get since I have the
keys embedded in the strings in the resultant node-set.

All was well, until my colleague decided to try out Saxon 7.1 which (it turns out)
changes the behavior of tokenize(). In that version, the nodeset comes back in
such a way that you can't use the "preceding" axis on it.

There are features in Saxon 7.1 that we are very interested in, so I needed
to try to find a different technique.

It turns out that the following has exactly the desired effect (in one line!!)

  <xsl:variable name="listNoDups"
                select="saxon:distinct($list, saxon:expression('concat(@key1,@key2)'))"/>

and I could have done that all along.

However, I still wondered if there was a way of doing this without extensions.
So I put the problem to my good friend Chris Maden (yes, *the* Chris Maden)
... but not in as much detail as I have given here.

Chris said "Muenchian Keys!!"

I hadn't yet used that technique anywhere (but heard it mentioned a lot)
so decided to give it a whirl.

Well, it does solve the problem, but with a restriction that makes it
unusable for me.

I set up my key like so:
  <xsl:key name="Key1Key2" match="item[@flavour='sour']/fact" use="concat(@key1,@key2)"/>

Then used:
  <xsl:variable name="uniqueKey1Key2forFlavour"
        select="$list[generate-id()=generate-id(key('Key1Key2',concat(@key1,@key2)))]"/>

Which does the trick, but I can't use it since xsl:key is a top-level element
and I have situation [A] to deal with.

So, my questions are ...
 [1] Is there a non-extension, non-xsl:key way of doing this?
 [2] If not, is there a better way than saxon:distinct approach?

Thanks for bearing with me :-)

I have attached my current test data, test transform and output since
it may help to clarify what I'm trying to do.

-- Mike Berrow

==========  input.xml  ==============
<document>
  <item flavour="sweet" >
    <fact key1="AA" key2="BB" val="11"/>
    <fact key1="XX" key2="CC" val="22"/>
    <fact key1="AA" key2="BB" val="33"/>
  </item>
  <item flavour="sour" >
    <fact key1="XX" key2="CC" val="11"/>
    <fact key1="XX" key2="BB" val="33"/>
    <fact key1="YY" key2="BB" val="22"/>
  </item>
  <item flavour="sweet" >
    <fact key1="XX" key2="CC" val="33"/>
    <fact key1="XX" key2="BB" val="22"/>
    <fact key1="AA" key2="BB" val="11"/>
  </item>
  <item flavour="sour" >
    <fact key1="YY" key2="BB" val="33"/>
    <fact key1="XX" key2="CC" val="11"/>
    <fact key1="YY" key2="BB" val="22"/>
  </item>
</document>


==========  dupElim.xsl  ==============
<?xml version="1.0"?>
<xsl:stylesheet
            xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
            xmlns:saxon="http://icl.com/saxon";
            version="1.0">

<!-- Finding unique nodes in a non-sibling nodeset... by Mike Berrow -->
<xsl:output method="xml"/>
<xsl:key name="Key1Key2" match="item[@flavour='sour']/fact" use="concat(@key1,@key2)"/>

<xsl:template match="document">
  <!-- Select nodes of interest -->
  <xsl:variable name="list" select="item[@flavour='sour']/fact"/>

  <!-- Single value, attempt 1 -->
  <xsl:comment>For $list[not(@key1=preceding-sibling::*/@key1)]</xsl:comment>
  <xsl:text>&#xA;&#x9;</xsl:text><xsl:comment>We get ...</xsl:comment>
  <xsl:variable name="list1NoDups" select="$list[not(@key1=preceding-sibling::*/@key1)]"/>
  <xsl:for-each select="$list1NoDups">
    <xsl:text>&#xA;&#x9;</xsl:text>
    <xsl:value-of select="concat(@key1,'/',@key2)" />
  </xsl:for-each>
  <xsl:text>&#xA;&#x9;</xsl:text>
  <xsl:comment>Not desired: 'preceding-sibling' can't see 'preceding cousin'</xsl:comment><xsl:text>&#xA;&#xA;</xsl:text>

  <!-- Single value, attempt 2 -->
  <xsl:comment>For $list[not(@key1=preceding::*/@key1)]</xsl:comment>
  <xsl:text>&#xA;&#x9;</xsl:text><xsl:comment>We get ...</xsl:comment>
  <xsl:variable name="list2NoDups" select="$list[not(@key1=preceding::*/@key1)]"/>
  <xsl:for-each select="$list2NoDups">
    <xsl:text>&#xA;&#x9;</xsl:text>
    <xsl:value-of select="concat(@key1,'/',@key2)" />
  </xsl:for-each>
  <xsl:text>&#xA;&#x9;</xsl:text>
  <xsl:comment>Not desired: 'preceding' looks at the whole doc</xsl:comment><xsl:text>&#xA;&#xA;</xsl:text>

  <!-- Try Multi-value -->
  <xsl:comment>For $list[not(concat(@key1,@key2)=concat(preceding::*/@key1,preceding::*/@key2))]</xsl:comment>
  <xsl:text>&#xA;&#x9;</xsl:text><xsl:comment>We get ...</xsl:comment>
  <xsl:variable name="list3NoDups" select="$list[not(concat(@key1,@key2)=concat(preceding::*/@key1,preceding::*/@key2))]"/>
  <xsl:for-each select="$list3NoDups">
    <xsl:text>&#xA;&#x9;</xsl:text>
    <xsl:value-of select="concat(@key1,'/',@key2)" />
  </xsl:for-each>
  <xsl:text>&#xA;&#x9;</xsl:text>
  <xsl:comment>Not desired: result of a naive composite key attempt</xsl:comment><xsl:text>&#xA;&#xA;</xsl:text>

  <!-- Multi-value using saxon::tokenize -->
  <xsl:comment>Using aggregation, saxon:tokenize then 'not(.=preceding-sibling::*)'</xsl:comment>
  <xsl:variable name="aggregate">
    <xsl:for-each select="$list">
      <xsl:value-of select="concat(@key1,'/',@key2)" />
      <xsl:if test="not(position()=last())"><xsl:text>#</xsl:text></xsl:if>
    </xsl:for-each>
  </xsl:variable>
  <xsl:variable name="list4" select="saxon:tokenize($aggregate,'#')"/>
  <xsl:variable name="list4NoDups" select="$list4[not(.=preceding-sibling::*)]"/>
  <xsl:for-each select="$list4NoDups">
    <xsl:text>&#xA;&#x9;</xsl:text>
    <xsl:value-of select="." />
  </xsl:for-each>
  <xsl:text>&#xA;&#x9;</xsl:text>
  <xsl:comment>Which is the desired result</xsl:comment><xsl:text>&#xA;&#xA;</xsl:text>

  <!-- Multi-value using saxon::distinct -->
  <xsl:comment>saxon:distinct($list, saxon:expression('concat(@key1,@key2)')</xsl:comment>
  <xsl:for-each select="saxon:distinct($list, saxon:expression('concat(@key1,@key2)'))">
    <xsl:text>&#xA;&#x9;</xsl:text>
    <xsl:value-of select="concat(@key1,'/',@key2)" />
  </xsl:for-each>
  <xsl:text>&#xA;&#x9;</xsl:text>
  <xsl:comment>Which is tighter code than using tokenize</xsl:comment><xsl:text>&#xA;&#xA;</xsl:text>

  <!-- Multi-value using Muenchian -->
  <xsl:comment>Using <xsl:text>&lt;xsl:key name="Key1Key2" match="item[@flavour='sour']/fact"
use="concat(@key1,@key2)"/&gt;</xsl:text>
    and select="$list[generate-id(.)=generate-id(key('Key1Key2',concat(@key1,@key2)))]"</xsl:comment>
  <xsl:variable name="uniqueKey1Key2forFlavour"
        select="$list[generate-id()=generate-id(key('Key1Key2',concat(@key1,@key2)))]"/>
  <xsl:for-each select="$uniqueKey1Key2forFlavour">
    <xsl:text>&#xA;&#x9;</xsl:text>
    <xsl:value-of select="concat(@key1,'/',@key2)" />
  </xsl:for-each>
  <xsl:text>&#xA;&#x9;</xsl:text>
  <xsl:comment>Which is the Muenchian approach, but since xsl:key is a top level element, this
      will not help when nodesets need to be calculated in specific, non-whole-document
contexts</xsl:comment><xsl:text>&#xA;&#xA;</xsl:text>

</xsl:template>

</xsl:stylesheet>


==========  minSet.xml  ==============
<?xml version="1.0" encoding="utf-8"?>
<!--For $list[not(@key1=preceding-sibling::*/@key1)]-->
 <!--We get ...-->
 XX/CC
 YY/BB
 YY/BB
 XX/CC
 <!--Not desired: 'preceding-sibling' can't see 'preceding cousin'-->

<!--For $list[not(@key1=preceding::*/@key1)]-->
 <!--We get ...-->
 YY/BB
 <!--Not desired: 'preceding' looks at the whole doc-->

<!--For $list[not(concat(@key1,@key2)=concat(preceding::*/@key1,preceding::*/@key2))]-->
 <!--We get ...-->
 XX/CC
 XX/BB
 YY/BB
 YY/BB
 XX/CC
 YY/BB
 <!--Not desired: result of a naive composite key attempt-->

<!--Using aggregation, saxon:tokenize then 'not(.=preceding-sibling::*)'-->
 XX/CC
 XX/BB
 YY/BB
 <!--Which is the desired result-->

<!--saxon:distinct($list, saxon:expression('concat(@key1,@key2)')-->
 XX/CC
 XX/BB
 YY/BB
 <!--Which is tighter code than using tokenize-->

<!--Using <xsl:key name="Key1Key2" match="item[@flavour='sour']/fact" use="concat(@key1,@key2)"/>
  and select="$list[generate-id(.)=generate-id(key('Key1Key2',concat(@key1,@key2)))]"-->
 XX/CC
 XX/BB
 YY/BB
 <!--Which is the Muenchian approach, but since xsl:key is a top level element, this
   will not help when nodesets need to be calculated in specific, non-whole-document contexts-->




 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]