Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: RFC, an ugly parser hack (and a bin-xml variant)

From: "cr88192" <cr88192@------.-------.--->
To: NULL
Date: 9/7/2005 5:03:00 PM
"Soren Kuula" <dongfang@d...> wrote in message 
news:9RqTe.66931$Fe7.224658@n......
> cr88192 wrote:
>> for various reasons, I added an imo ugly hack to my xml parser.
>> basically, I wanted the ability to have binary payload within the xml 
>> parse trees.
>>
>> this was partly because I came up with a binary xml format (mentioned 
>> more later), and thought it would be "useful" to be able to store binary 
>> data inline with this format, and still wanted to keep things balanced 
>> (whatever the binary version can do, the textual version can do as well).
>>
>> the approach involved, well, a bastardized subset of xml-data.
>> the attribute 'dt:dt' now has a special meaning (along with the rest of 
>> the 'dt' namespace prefix), and the contents of such nodes are parsed 
>> specially (though still within xml's syntactic rules, eg, as a normal xml 
>> text glob).
>
> My comment: Why don't you use the normal namespace mechanism, instead of 
> magic prefixes?
>
well, at the time I had figured it would be an extra hassle.

now, I am thinking, it makes more sense anyways, I just need to specify that 
a certain namespace needs to be used for binary tags, and make the necessary 
changes to allow resolving the namespace.

> The parser must store the namespace prefix-->URI bindings is had 
> encountered so far at some place anyway (if it's a namespace aware 
> parser). It should then also be possible to modify it to go to binary mode 
> when entering elements in your special namespace -- and leave it again 
> when exiting (keep a counter of the nesting depth, increment, decrement). 
> In binary mode, it will decode text to binary.
>
yes, I could do this.

a hassle though is that the parser, during parsing, does not know the 
correct namespaces. namespaces are resolved by stepping up the tree, and, 
presently, nodes are not bound into the tree until after they are parsed (as 
a result, I can't look up the tree during parsing).

an alteration would be to allways pass the parent node to the parse-node 
function and setting the 'up' value before parsing sub-expressions, at 
least, so the search could be done during parsing and thus be able to 
resolve namespaces and avoiding needing a magic prefix...


my parser is a recursive function (as opposed to a stack or similar).
also, it is damn slow, or at least when I throw larger files at it...
I could probably try to make it faster if needed.


also, I will add a comment:
the files are a little smaller, eg, the 900kB xml file now becomes 85kB in 
my format, and about 200kB in wbxml (not using any dtd's or similar, which 
is probably a bad case for wbxml). this was after a few minor tweaks to the 
format (eg: adding a workable means to eliminate many end markers).

otherwise, the wbxml writer may be a bit naive, which could lead to the 
size. most of the tags and text contents end up as string table references, 
which are naturally more expensive than my format (a few bytes, vs. a single 
byte for anything in the mru list).
small files using dtd's are likely to do better (for large files, I doubt it 
makes much difference, the costs should be about the same due to the small 
contribution in total size of the strings table).

also, I don't know if wbxml supports namespaces (I guess it could be done if 
the prefix is treated as part of the tag). if this is the case, then it is 
probable wbxml could win out with namespace-heavy code.
why do I need a 32 element namespace mru anyways? it is doubtful this many 
unique prefixes will be used anyways. it just fit well with the pattern I 
guess.

the gzip'ed version is still about 40kB.

sizewise, my format is worse than gzip, but not that terribly worse. beating 
gzip wrt size would risk compromising speed...
at least, it doesn't require decoding and using my textual parser...




transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent