Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


correction (Re: RFC, an ugly parser hack (and a bin-xml variant))

From: "cr88192" <cr88192@------.-------.--->
To: NULL
Date: 9/5/2005 2:12:00 PM
>
> it is, well, signifigantly faster than my textual parser, largely because 
> of the dramatic reduction in memory allocation. this is partly because, as 
> a matter of the format's operation, most strings are merged. likewise, it 
> is a bit smaller (around 20-30% the original size in my testing), which is 
> a bit worse than what I can get from "real" compressors, but this is no 
> big loss.
>
just checked and recalculated percents, realized it was doing somewhat 
better than this, eg, around 10% original size or so for some larger files 
(around 900kB initial, around 1MB after being spit back out from my app with 
different formatting).

binary files are presently about 2x as large as that of the output from gzip 
(eg: initial, about 900kB, my format about 100kB, gzip about 40kB).

somehow, I had not taken this into account, remembering my initial results 
with smaller xml files (eg: 1.5kB to 400 bytes, ...).

as for huffman compression, if done, it would likely be at least close to, 
or maybe exceed that of gzip. this is difficult to predict though given the 
signifigant differences in the algos (gzip might win due to its ability to 
utilize patterns spaning multiple tags, but might be hurt by its inability 
to deal with regular but predictable variations in the pattern).

gzip'ing the binary variant leads to an output of about 30kB, so about 10kB 
less than gziping the input file. a specialized compressor may thus have a 
chance.

each tag as a huffman code, possibly using a lz77 or markov variant for the 
strings (lz77+huffman is the base of gzip anyways), ...

but, then again, speed may no longer be good. by this point it may have 
dropped somewhat below the speed of the normal text printer/parser, 
effectively losing part of the gain.

actually, it may yet be slower than defalte, eg, given my tendency to be 
lazy and use adaptive huffman coding most of the time (slower but generally 
easier to manage than the static varieties used in gzip/deflate). actually, 
the varieties I use are more often "quasi-static", eg, they only update 
every so often, vs after every symbol (I can, for example, encode a few kB 
and then rebuild the trees, which is faster than a pure-adaptive variant, 
but slower than static). as a result, for decoding at least I can still use 
an index table (vs. having to resort to decoding the file a single bit at a 
time). one then has to tune how often they rebuild the trees/tables 
(rebuilding more often hurts speed, but typically helps compression).


not like it matters probably.
I am just a lame hobbyist...




transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent