Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: Most performant way to sort large files

From: "Dimitre Novatchev" <dimitren@---.---.-->
To: NULL
Date: 8/2/2007 8:55:00 PM


"jeh" <jeh@d...> wrote in message 
news:A9A37A0D-65D9-4104-9C96-F210A4BCD81A@m......
> Hey,
>
> Thanks! That's alot faster. There are however two things that I would like
> to follow up on here.
>
> 1st: The files differ greatly in size, before sort it's about 15MB and 
> after
> it's roughly 7, what happened seems to be it removing all linebreaks,
> whitespace and such. What would I have to do to have (at least) 
> linebreaks? I
> don't really need indention, but that would be nice as well. Linebreaks
> however is causing me to not be able to opening it in for example Visual
> Studios Xml editor, since it's too much on one line.


Use the
    <xsl:output .../>

directive. Specify

  indent="yes"

>
> 2nd: When researching the fastest sort algorithm I got the impression that
> using keys would be way faster, why isnt it, and when would I introduce 
> keys
> as opposed to the simple sort statement used here? Is it when complexity 
> grow
> to more then one sort statement, or when I introduce grouping (two levels 
> of
> sorting)? Or is it just an "old" approach?

Keys are useful for getting quick access to a *subset* of nodes in an xml 
file (typically when the nodes should be accessed more than once)  -- such 
as grouping or implementing a "link" between two nodes.

Sort is O(N*log(N)) and cannot be made significantly faster if one of the 
fastest sorting algorithms is already used.


Cheers,
Dimitre Novatchev


>
> Thanks,
> John
>
> "Dimitre Novatchev" wrote:
>
>> Usually the simplest way is the fastest. Using this code is 10 times 
>> faster
>> (1.6sec vs 16 sec) when transformed with .NET XSLCompiledTransform and 2
>> times faster (16se vs 32 sec) when transformed with MSXML4:
>>
>>
>>     <Level1>
>>       <xsl:for-each select="Level2">
>>         <xsl:sort select="@UseThisToSort" data-type="text" 
>> order="ascending"
>> />
>>             <xsl:copy-of select="." />
>>       </xsl:for-each>
>>     </Level1>
>>
>>
>> Cheers,
>> Dimitre Novatchev
>>
>> "jeh" <jeh@d...> wrote in message
>> news:7BC89F8E-D8E3-4327-8A8B-88E9C83CE90C@m......
>> > Hi all,
>> >
>> > I have been testing a few xsl sorting statements - I want to output the
>> > same
>> > file, as I input, no direct transformations, just sorted. My files are
>> > (can
>> > be) large.. somewhere in the neighbourhood of 12MB (and sometimes 
>> > more) -
>> > which makes them somewhere around 50k rows depending on the complexity 
>> > of
>> > the
>> > different documents.
>> >
>> > The issue that I am having is that my xsl statement is fast on small
>> > documents but not so much so when they grow larger - a 12MB document 
>> > takes
>> > about 10 minutes in the test case presented below, and even more so 
>> > when I
>> > have larger more complex actual structures. The sample below is 
>> > simplified
>> > to
>> > act as just that. The xsl is "real" though.
>> >
>> > What is the most performant way of sorting a document? What can be
>> > improved
>> > in my xsl below? With performant I mean it to be fast as well as
>> > preferably
>> > light on the memory. Below I use keys, but in my understanding that's
>> > quite
>> > memory intensive - what are the tradeoffs between different approaches?
>> >
>> > Oh, and I am looking to use this in BizTalk, so I'm first and foremost
>> > interested in 1.0 interpreter and .Net compatible solutions. Other
>> > solutions
>> > are however more then welcome - I'd be glad to get that perspective as
>> > well.
>> >
>> > TIA,
>> > John
>> >
>> > ---------
>> > My xsl:
>> > ---------
>> > <?xml version="1.0" encoding="utf-16"?>
>> > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>> > xmlns:msxsl="urn:schemas-microsoft-com:xslt"
>> > xmlns:var="http://schemas.microsoft.com/BizTalk/2003/var"
>> > exclude-result-prefixes="msxsl var" version="1.0"
>> > xmlns:ns0="http://BizTalkXsltSortTest.SimpleSample.Schema1">
>> >  <xsl:output omit-xml-declaration="yes" method="xml" version="1.0" />
>> >  <xsl:key name="sort-key" match="Level2" use="@UseThisToSort" />
>> >  <xsl:template match="/">
>> >    <xsl:apply-templates select="/ns0:Root" />
>> >  </xsl:template>
>> >  <xsl:template match="/ns0:Root">
>> >    <ns0:Root>
>> >      <xsl:apply-templates select="Level1" />
>> >    </ns0:Root>
>> >  </xsl:template>
>> >  <xsl:template match="Level1">
>> >    <Level1>
>> >      <xsl:for-each select="Level2[count(. | key('sort-key',
>> > @UseThisToSort)[1]) = 1]">
>> >        <xsl:sort select="@UseThisToSort" data-type="text"
>> > order="ascending"
>> > />
>> >        <xsl:for-each select="key('sort-key', @UseThisToSort)">
>> >            <xsl:copy-of select="." />
>> >          </xsl:for-each>
>> >      </xsl:for-each>
>> >    </Level1>
>> >  </xsl:template>
>> > </xsl:stylesheet>
>> >
>> > ---------
>> > Sample xml:
>> > ---------
>> > <ns0:Root xmlns:ns0="http://BizTalkXsltSortTest.SimpleSample.Schema1">
>> >  <Level1>
>> >    <Level2 UseThisToSort="20010201" MiscData1="MiscData1_1"
>> > MiscData2="MiscData2_2" />
>> >    <Level2 UseThisToSort="20010101" MiscData1="MiscData1_1"
>> > MiscData2="MiscData2_2" />
>> >    <Level2 UseThisToSort="20010102" MiscData1="MiscData1_1"
>> > MiscData2="MiscData2_2" />
>> >    <Level2 UseThisToSort="20010101" MiscData1="MiscData1_1"
>> > MiscData2="MiscData2_2" />
>> >    <Level2 UseThisToSort="20010101" MiscData1="MiscData1_1"
>> > MiscData2="MiscData2_2" />
>> >    <Level2 UseThisToSort="20010201" MiscData1="MiscData1_1"
>> > MiscData2="MiscData2_2" />
>> >    <Level2 UseThisToSort="20010101" MiscData1="MiscData1_1"
>> > MiscData2="MiscData2_2" />
>> >  </Level1>
>> > </ns0:Root>
>> >
>> > ---------
>> > Schema:
>> > ---------
>> > <?xml version="1.0" encoding="utf-16"?>
>> > <xs:schema xmlns:b="http://schemas.microsoft.com/BizTalk/2003"
>> > xmlns="http://BizTalkXsltSortTest.SimpleSample.Schema1"
>> > targetNamespace="http://BizTalkXsltSortTest.SimpleSample.Schema1"
>> > xmlns:xs="http://www.w3.org/2001/XMLSchema">
>> >  <xs:element name="Root">
>> >    <xs:complexType>
>> >      <xs:sequence>
>> >        <xs:element minOccurs="0" maxOccurs="unbounded" name="Level1">
>> >          <xs:complexType>
>> >            <xs:sequence>
>> >              <xs:element minOccurs="0" maxOccurs="unbounded"
>> > name="Level2">
>> >                <xs:complexType>
>> >                  <xs:attribute name="UseThisToSort" type="xs:string" />
>> >                  <xs:attribute name="MiscData1" type="xs:string" />
>> >                  <xs:attribute name="MiscData2" type="xs:string" />
>> >                </xs:complexType>
>> >              </xs:element>
>> >            </xs:sequence>
>> >          </xs:complexType>
>> >        </xs:element>
>> >      </xs:sequence>
>> >    </xs:complexType>
>> >  </xs:element>
>> > </xs:schema>
>>
>>
>> 




transparent
Print
Mail
Digg
delicious
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent