Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: MSXML - LINEAR cost taking a SINGLE step of IXMLDomNodeList enumerator??

From: "Anthony Jones" <Ant@------------.--->
To: NULL
Date: 12/5/2007 7:53:00 PM


"lab27" <lee.benfield@g...> wrote in message
news:795f1bf0-544a-481f-ba38-2649a6f67728@p......
> When walking a large DomNodeList using the IEnumVariant interface
> (I've included VB code below, but obviously it's the same in C++ using
> COM enumerators), the cost of going to the next element seems to be
> proportional to the TOTAL length of the list!! Is this right?
>
> I haven't found this mentioned anywhere, maybe I'm just not googling
> right ;)
>
> i.e. walking the first 1000 nodes of a 2000 node list is massively
> cheaper than walking the first 1000 nodes of a 20000 node list.  - in
> practice this means if you're traversing the whole list your cost is
> n^2, which on the large lists I'm dealing with is causing a bit of
> pain....
>
> This is pretty confusing -  it doesn't look like the list is being
> generated lazily (and even if it was I'd not expect linear behaviour
> like this for accessing a prefix) - it almost looks like it's re-
> evaluating the predicate each iteration step! ;)
>
> Anyone know anything about this?  Behaviour doesn't seem to differ
> between msxml4 / 6...
>
> Thanks,
>
> Lee.
>
> -----8<----------  sample VBA to demo, needs ref to msxml4/6.
>
> Private Declare Function QueryPerformanceCounter& Lib
> "kernel32.dll" (l As pbdw)
> Private Declare Function QueryPerformanceFrequency& Lib
> "kernel32.dll" (l As pbdw)
>
> Private Type pbdw
>     bottom As Long
>     top As Long
> End Type
>
> Public Function tickspers() As Double
>     Static l As Long
>     If l = 0 Then
>         Dim ll As pbdw
>         Call QueryPerformanceFrequency(ll)
>         l = ll.bottom
>     End If
>     ticksperms = l
> End Function
>
> Public Function ticks() As Double
>     Dim ll As pbdw
>     Call QueryPerformanceCounter(ll)
>     ticks = ll.bottom
> End Function
>
> Private Sub add_node(ByRef s As String)
>     s = s & "<node><a>1</a><b>2</b></node>"
> End Sub
>
> Private Sub check_timing(ByVal s As String)
>     s = s & "</doc>"
>
>     Dim d As DOMDocument
>     Set d = New DOMDocument40 ' or domdocument60,
>     If Not d.loadXML(s) Then Call Err.Raise(10101, , "Didn't load")
>
>     Dim n As IXMLDOMNode
>     Set n = d.documentElement
>     Dim dn As IXMLDOMNodeList
>
>     Set dn = n.selectNodes("node")
>     If dn.Length < 1000 Then Call Err.Raise(10102, , "Odd...")
>
>     Dim dstart As Double, dstop As Double
>     dstart = ticks()
>
>     Dim i As Long
>     i = 0
>     Dim count As Long
>     For Each n In dn
>         i = i + 1
>         count = count + Len(n.Text)
>         If (i = 1000) Then
>             GoTo breakout
>         End If
>     Next n
> breakout:
>     dstop = ticks()
>     Debug.Print Join(Array(dn.Length, (dstop - dstart) / tickspers,
> count), " , ")
> End Sub
>
> Public Sub test()
>     Dim doc As String
>     doc = "<doc>"
>     Dim l As Long
>     For l = 1 To 1000
>         Call add_node(doc)
>     Next l
>
>     Dim step_ As Long
>     step_ = 1000
>
>     Dim stepstr As String
>     Dim x As Long
>     For x = 1 To step_
>         Call add_node(stepstr)
>     Next x
>
>     For l = 1000 To 40000 Step step_
>         Call check_timing(doc)
>         doc = doc & stepstr
>     Next l
> End Sub
>
>


The culprit is this line in your check_timing routine:-

         count = count + Len(n.Text)

Specifically the use of n.Text.  It also isn't related to the number of
nodes selected try this change:-

    Set dn = n.selectNodes("node[position() < 1001]")

You'll notice that dn.Length is consitently 1000 yet the growth in time
taken is still proportional to the full set of nodes.  IOW, the cost of
retrieving .Text is proportional to the size of document that the node
belongs to.  You can try this change to keep it to a single simple element:-

    Set dn = n.selectNodes("node[position() < 1001]/a")

It's still bad. Take it a step further lets get the text node itself:-

    Set dn = n.selectNodes("node[position() < 1001]/a/text()")

Nope still just as bad.

Now make this change

        count = count + Len(n.nodeValue)

Suddenly it much quicker and consistent.  Make these changes:-

Set dn = n.selectNodes("node[position() < 1001]/a")

count = count + Len(n.firstChild.nodeValue)

Again consistent, lets go back to selecting the node and build the text
string ourselves.

Set dn = n.selectNodes("node[position() < 1001]")

count = count + Len(n.firstChild.firstChild.nodeValue &
n.firstChild.nextSibling.nodeValue)

Ugly but still a couple of orders of magnitude faster than .Text and
consistant despite document size.

Finally remove the position predicate.  Just as fast.


Why would using the Text property be so costly? I haven't got the foggiest,
very odd and annoying.


-- 
Anthony Jones - MVP ASP/ASP.NET




transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent