Altova Mailing List Archives>Archive Index >comp.text.xml Archive Home >Recent entries >Thread Prev - RFC, an ugly parser hack (and a bin-xml variant) [Thread Next] correction (Re: RFC, an ugly parser hack (and a bin-xml variant))To: NULL Date: 9/5/2005 2:12:00 PM > > it is, well, signifigantly faster than my textual parser, largely because > of the dramatic reduction in memory allocation. this is partly because, as > a matter of the format's operation, most strings are merged. likewise, it > is a bit smaller (around 20-30% the original size in my testing), which is > a bit worse than what I can get from "real" compressors, but this is no > big loss. > just checked and recalculated percents, realized it was doing somewhat better than this, eg, around 10% original size or so for some larger files (around 900kB initial, around 1MB after being spit back out from my app with different formatting). binary files are presently about 2x as large as that of the output from gzip (eg: initial, about 900kB, my format about 100kB, gzip about 40kB). somehow, I had not taken this into account, remembering my initial results with smaller xml files (eg: 1.5kB to 400 bytes, ...). as for huffman compression, if done, it would likely be at least close to, or maybe exceed that of gzip. this is difficult to predict though given the signifigant differences in the algos (gzip might win due to its ability to utilize patterns spaning multiple tags, but might be hurt by its inability to deal with regular but predictable variations in the pattern). gzip'ing the binary variant leads to an output of about 30kB, so about 10kB less than gziping the input file. a specialized compressor may thus have a chance. each tag as a huffman code, possibly using a lz77 or markov variant for the strings (lz77+huffman is the base of gzip anyways), ... but, then again, speed may no longer be good. by this point it may have dropped somewhat below the speed of the normal text printer/parser, effectively losing part of the gain. actually, it may yet be slower than defalte, eg, given my tendency to be lazy and use adaptive huffman coding most of the time (slower but generally easier to manage than the static varieties used in gzip/deflate). actually, the varieties I use are more often "quasi-static", eg, they only update every so often, vs after every symbol (I can, for example, encode a few kB and then rebuild the trees, which is faster than a pure-adaptive variant, but slower than static). as a result, for decoding at least I can still use an index table (vs. having to resort to decoding the file a single bit at a time). one then has to tune how often they rebuild the trees/tables (rebuilding more often hurts speed, but typically helps compression). not like it matters probably. I am just a lame hobbyist... | ||||||
| Company | Legal | Press | Partners | Careers | Sitemap | Contact Us | Altova Blog | Mobile | Full Site | |||
|
