This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Re: stylesheet vs egrep
- From: Ahmad J Reeves <ahmad at dcs dot qmul dot ac dot uk>
- To: xsl-list at lists dot mulberrytech dot com
- Date: Fri, 25 Jan 2002 14:07:05 +0000
- Subject: Re: [xsl] stylesheet vs egrep
- Organization: Dept of Computer Science, QMW
- References: <3C514315.63FA0EC9@dcs.qmul.ac.uk> <fli25u8501h0utpiu8g8pd0li56t8839t2@4ax.com>
- Reply-to: xsl-list at lists dot mulberrytech dot com
Hi Trevor,
First many thanks for your reply. The files I am processing
are 20megs each by the way.
I tried the stylesheet and it gave me 28,792 unsorted and
163 sorted, which was the same as my last stylesheet and
still not the 254 given to me by egrep. My egrep command
egrep "<CHARACTER_ID> [0-9]{3,6} </CHARACTER_ID>" 1.xml |sort -u | wc -l
is maybe doing something strange? Heres the first 20..
<CHARACTER_ID> 10946 </CHARACTER_ID>
<CHARACTER_ID> 11084 </CHARACTER_ID>
<CHARACTER_ID> 11116 </CHARACTER_ID>
<CHARACTER_ID> 11311 </CHARACTER_ID>
<CHARACTER_ID> 11457 </CHARACTER_ID>
<CHARACTER_ID> 12284 </CHARACTER_ID>
<CHARACTER_ID> 12426 </CHARACTER_ID>
<CHARACTER_ID> 12597 </CHARACTER_ID>
<CHARACTER_ID> 12969 </CHARACTER_ID>
<CHARACTER_ID> 13172 </CHARACTER_ID>
<CHARACTER_ID> 13680 </CHARACTER_ID>
<CHARACTER_ID> 13685 </CHARACTER_ID>
<CHARACTER_ID> 14371 </CHARACTER_ID>
<CHARACTER_ID> 16142 </CHARACTER_ID>
<CHARACTER_ID> 16783 </CHARACTER_ID>
<CHARACTER_ID> 16851 </CHARACTER_ID>
<CHARACTER_ID> 17443 </CHARACTER_ID>
<CHARACTER_ID> 17583 </CHARACTER_ID>
<CHARACTER_ID> 17933 </CHARACTER_ID>
<CHARACTER_ID> 17958 </CHARACTER_ID>
And the first 20 of your stylesheet...
10010
10347
10904
10946
11084
11116
11237
11311
11457
12284
12426
12597
12599
12969
13172
13680
13685
14211
14371
14791
so there are numbers in the stylesheet that egrep is missing
e.g the top 3, but still produces less....!?
Mystery..
Any one?
Ahmad
Ahmad
Trevor Nash wrote:
>
> On Fri, 25 Jan 2002 11:35:49 +0000, Ahmad J Reeves wrote:
>
> >Hi there,
> >
> >I have xml files that contain 4 types of tags,
> >direct,local,global and admin in varying numbers
>
> >I need to get a list of all the character_id's, and then
> >remove the duplicates and count them. With the following
> >stylesheet,
> >[snip]
> >Is it my stylesheet thats lying, or my egrep ?
> >
> The stylesheet, because you are forgetting the built-in templates.
> This means two things:
> 1. the default is to copy text nodes to the output: some of these are
> numbers, hence the strange results.
> 2. you are doing much more work than is necessary, since most of your
> templates are just visiting children, which is what the default does
> anyway.
>
> Try this:
> <xsl:stylesheet
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> version="1.0">
> <xsl:output method="text"/>
>
> <xsl:variable name="NL" select="'
'"/>
>
> <xsl:template match="CHARACTER_ID">
>
> <xsl:value-of select="."/>
> <xsl:value-of select="$NL"/>
>
> </xsl:template>
>
> <!-- throw away all text nodes -->
> <xsl:template match="text()" />
>
> </xsl:stylesheet>
>
> The only reason for putting other templates in would be to avoid
> traversing bits of the document where you know there are no
> CHARACTER_ID nodes, which might make the transform a bit faster.
> Unless the input document is huge this isn't likely to make much
> difference, and of course it makes it more prone to bugs.
>
> Regards
> Trevor Nash
> --
> Traditional training & distance learning,
> Consultancy by email
>
> Melvaig Software Engineering Limited
> voice: +44 (0) 1445 771 271
> email: tcn@melvaig.co.uk
>
> XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list