Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


lazy trees and consuming iterators: yet another processing model

From: Jay Carlson <nop@---.--->
To: xml-dev@-----.---.---
Date: 3/1/2004 7:58:00 PM
I've designed a hybrid API for XML parsing.  Oh no, not another one.
Here's the two bullet summary:

  o  Document trees are parsed lazily; it looks like a tree model

  o  _Consuming iterators_ operate on the children of a node, removing
     each child from its parent before processing

In this model, you can flip back and forth between random access to a
subtree for convenience, and consuming iterators for all the
performance advantages of event pull---with minimal changes to code.

An initial implementation of this and some notes are at
http://lua-users.org/wiki/LazyKit .  The closest thing to this model
that I've found is XPP2's XmlPullNode; I'm looking for other
references.

This was inspired by XML::Twig and pulldom.  In particular, pulldom's
terse documentation includes this quote:

   As you can see, the events stream object is pretty
   sophisticated. Still, it has limits. You can only expand the
   current node because events and nodes relating to any other node
   are probably lost in the mists of time. An XML document into a
   random access data structure!

"Lost in the mists of time" sounded a little like how garbage
collectors work: they free memory when they can prove that the memory
can't affect the future of the program.  That thought led to the
obvious Lua code:

  for i,subtree in xpairs(parent) do
    parent[i] = nil
    do_something_with(subtree)
  end

The xpairs_c() iterator is effectively a shortcut for that:

  for i,subtree in xpairs_c(parent) do
    do_something_with(subtree)
  end

Note that up until you begin consuming a tree, you can use all the
conventional iterators and random access you want.  But once you start
consuming, you've promised that you no longer care about addressing
its (direct) contents.

From an event pull perspective, then, the tree view is an
unlimited-length lookahead buffer of events associated with an
element.

One concrete example is XML-RPC <value> nodes. If they contain an
element such as <integer>, that's the value of the <value>
node. Otherwise, the character data content of the <value> element is
a string-typed value. The code to process the element must potentially
look ahead an arbitrary number of initial character data nodes to find
out whether there is an element lurking inside, or it must default to
string content.

Even though you're consuming a tree, you can still save references to
subtrees:

  for i,subtree in xpairs_c(parent) do
    if subtree.name == "xref" then
      table.insert(references, subtree)
    end
  end

(pedantic: subtree is also lazily parsed, so at the time it's added to
the references table, it only physically contains a start node.  But
as soon as the next element in parent is touched, it will be
completed.)

In the above example, non-<xref> nodes will still be fully loaded into
memory, as the iterator can't tell that you haven't saved a reference
to them.  The current remedy is to manually call consume() on nodes
you don't want to keep, and filtering iterators will do that for
skipped nodes.  But in a different implementation environment,
examination of refcounts inside the iterator implementation could
determine that the non-<xref> nodes were unreachable, and hence
skippable.

Although I've not yet implemented it, consuming can interact with the
underlying parser.  Consider:

  <document>
    <firstname>Jay</firstname>
    <lastname>Carlson</lastname>
    <bodytext>Spending too much time listening to <ref>In Utero</ref>
       can be [...]
    </bodytext>
    <title>I Think I'm DOM</title>
  </document>

  lastname, title = xmlextract.strings_consume(tree, "lastname", "title")

The strings_consume filter can potentially turn off character data and
other events inside any node it knows it doesn't need (like bodytext),
as references to them cannot possibly affect the rest of the program.

Finally, an example.  This converts a tiny subset of XHTML to the Lua
Wiki format.  A few things to note:

  o  This uses a consuming iterator (switch_c) but would be just as
     correct using a non-consuming iterator (switch).

  o  Many handlers (like body/h1) use xstring(), which is a conventional
     access tool for trees.  It knows nothing about consuming.

  o  The only memory used is for the current tree being processed and
     its parents.  Feel free to run it on a 4G file!

(...well, as long as those <h1> nodes are small...xstring() doesn't
consume, so their contents will be fully loaded.  The price of
convenience and reuse...)

Jay Carlson
nop@n...

--- snip ---

local root=lazytree.parsefile(filename)

local function printi(s)
   -- no newline
   io.stdout:write(s)
end

local ftable = {
    head=function () end;
    body={
        h1=function (h1)
            print("=== "..xstring(h1).." ===")
        end;
        h2=function (h2)
            print("=== "..xstring(h2).." ===")
        end;
        h3=function (h3)
            print("== "..xstring(h3).." ==")
        end;
        pre=function (pre)
            print("        {{{"..xstring(pre).." }}}")
        end;
        p={
            [""]=function (s)
                s = string.gsub(s, "\n[ ]+", "\n")
                s = string.gsub(s, "[ ]+", " ")
                printi(s)
            end;
            code=function (code)
                local s = xstring(code)
                s = string.gsub(s, "\n", " ")
                s = string.gsub(s, " [ ]+", " ")
                printi("{{"..s.."}}")
            end;
            dfn=function (dfn)
                printi("''"..xstring(dfn).."''")
            end;
            [-1]=function () print() print() end;
        }
    }
}

xmliter.switch_c(root, ftable, {no_tags=true})


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent