Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: [xml-dev] Exploiting multi-core CPUs during XML parsing

From: Tatu Saloranta <cowtowncoder@-----.--->
To: xml-dev@-----.---.---
Date: 4/1/2006 10:22:00 PM
--- Sean McGrath <sean.mcgrath@p...> wrote:

> I have sketched out an algorithm for fast XML WF
> parsing utilising two 
> threads that each start at opposite ends of the
> octet-stream and meet in 
> the middle. The algorithm hinges on the fact that
> start- and end-tags 
> are balanced. i.e. as one thread reads forward
> looking for foo 
> start-tag, the other thread is reading backwards
> looking for foo end-tag.
> 
> This also has the nice side effect of giving you
> accurate error messages 
> quickly. i.e. as soon as a mismatched tag is found,
> it can be reported. 

Keep in mind that in general you have lots of
subtrees, so you can not really assume that one thread
only matches start elements, and the other end
elements. So you don't really know which start tag
matches which end tag, before parsing the whole file.

> This is particularly useful with recursive element
> types.

This could work for a limited subset of XML, but there
are a few gotchas. Some problems you may face are:

* Namespace resolution pretty much has to be done in
document order
* Entity expansion is tricky to do; you need to
backtrack (from reverse reader) when hitting a '&'.
 Also, when resolving external entities, you may have
to read the whole external entity in-memory first, to
find the end (from reverse reader).
* CDATA sections need special handling. Fortunately,
]]> is not allowed anywhere in textual content, so you
can match that. However, more serious problem is how
to match the opening delimited, since that is allowed
to be repeated in CDATA, like:
"<![CDATA[<![CDATA[<![CDATA ...]]>"
* Processing instructions (like CDATA) are a pain to
parse, since they can have as many start markers as
they want ("<? proc instr <? <? <? <? ... .?>").
* Comments are quite easy, fortunately, as they can
not contain '--'.
* Handling of internal subset probably has to be done
in forward (non-reverse) order. Not a huge issue
(since it's near the start, but something to keep in
mind.

Also, another question is how are you planning to
combine the results? I assume you'd build a DOM
(-like) result tree, since it's not easy to think of a
useful stream abstraction.

Anyway, it may still be a useful exercise in figuring
out how to do things, although chances are there may
not be significant speedup in the end ;-)

-+ Tatu +-


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent