Altova Mailing List Archives>Archive Index >comp.text.xml Archive Home >Recent entries >Thread Prev - Re: RFC, an ugly parser hack (and a bin-xml variant) [Thread Next] Re: RFC, an ugly parser hack (and a bin-xml variant)To: NULL Date: 9/7/2005 5:03:00 PM "Soren Kuula" <dongfang@d...> wrote in message news:9RqTe.66931$Fe7.224658@n...... > cr88192 wrote: >> for various reasons, I added an imo ugly hack to my xml parser. >> basically, I wanted the ability to have binary payload within the xml >> parse trees. >> >> this was partly because I came up with a binary xml format (mentioned >> more later), and thought it would be "useful" to be able to store binary >> data inline with this format, and still wanted to keep things balanced >> (whatever the binary version can do, the textual version can do as well). >> >> the approach involved, well, a bastardized subset of xml-data. >> the attribute 'dt:dt' now has a special meaning (along with the rest of >> the 'dt' namespace prefix), and the contents of such nodes are parsed >> specially (though still within xml's syntactic rules, eg, as a normal xml >> text glob). > > My comment: Why don't you use the normal namespace mechanism, instead of > magic prefixes? > well, at the time I had figured it would be an extra hassle. now, I am thinking, it makes more sense anyways, I just need to specify that a certain namespace needs to be used for binary tags, and make the necessary changes to allow resolving the namespace. > The parser must store the namespace prefix-->URI bindings is had > encountered so far at some place anyway (if it's a namespace aware > parser). It should then also be possible to modify it to go to binary mode > when entering elements in your special namespace -- and leave it again > when exiting (keep a counter of the nesting depth, increment, decrement). > In binary mode, it will decode text to binary. > yes, I could do this. a hassle though is that the parser, during parsing, does not know the correct namespaces. namespaces are resolved by stepping up the tree, and, presently, nodes are not bound into the tree until after they are parsed (as a result, I can't look up the tree during parsing). an alteration would be to allways pass the parent node to the parse-node function and setting the 'up' value before parsing sub-expressions, at least, so the search could be done during parsing and thus be able to resolve namespaces and avoiding needing a magic prefix... my parser is a recursive function (as opposed to a stack or similar). also, it is damn slow, or at least when I throw larger files at it... I could probably try to make it faster if needed. also, I will add a comment: the files are a little smaller, eg, the 900kB xml file now becomes 85kB in my format, and about 200kB in wbxml (not using any dtd's or similar, which is probably a bad case for wbxml). this was after a few minor tweaks to the format (eg: adding a workable means to eliminate many end markers). otherwise, the wbxml writer may be a bit naive, which could lead to the size. most of the tags and text contents end up as string table references, which are naturally more expensive than my format (a few bytes, vs. a single byte for anything in the mru list). small files using dtd's are likely to do better (for large files, I doubt it makes much difference, the costs should be about the same due to the small contribution in total size of the strings table). also, I don't know if wbxml supports namespaces (I guess it could be done if the prefix is treated as part of the tag). if this is the case, then it is probable wbxml could win out with namespace-heavy code. why do I need a 32 element namespace mru anyways? it is doubtful this many unique prefixes will be used anyways. it just fit well with the pattern I guess. the gzip'ed version is still about 40kB. sizewise, my format is worse than gzip, but not that terribly worse. beating gzip wrt size would risk compromising speed... at least, it doesn't require decoding and using my textual parser... | ||||||
| Company | Legal | Press | Partners | Careers | Sitemap | Contact Us | Altova Blog | Mobile | Full Site | |||
|
