Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: Large XML files

From: "Jimmy Zhang" <crackeur@-------.--->
To: NULL
Date: 1/8/2006 2:07:00 PM
You can also try VTD-XML (http://vtd-xml.sf.net), which uses about 1.3~1.5x 
the
size of XML file. Currently it only supports files size of 1GB, so if you 
have 2GB of
physical memory, you can load everything in memory and perform random access 
on
it like DOM (of course with DOM will get outOfMem exception). Support for 
large
files are on the way.

"Jürgen Kahrs" <Juergen.KahrsDELETETHIS@v...> wrote in message 
news:40qo97F1bjsjpU1@i......
> jdev8080 wrote:
>
>> Basically, we have images that have associated metadata and we are
>> trying to develop a unified delivery mechanism.  Our XML documents may
>> be as large as 1GB and contain up to 100,000 images.
>>
>> My question is, has anyone done anything like this before?
>
> Yes, Andrew Schorr told me that he processes files
> of this size. After some experiments with Pyxie, he
> now uses xgawk with the XML extension of GNU Awk.
>
> http://home.vrweb.de/~juergen.kahrs/gawk/XML/
>
>> What are the performance considerations?
>
> Andrew stores each item in a separate XML file and
> the concatenates all the XML files to one large file,
> often large than 1 GB. My own performance measurements
> tell me that a modern PC should parse about 10 MB/s.
>
>> Do the current parsers support this size of XML file?
>
> Yes, but probably only SAX-like parsers.
> DOM-like parsers have to store the complete file
> in memory and are therefore limited by the amount
> of memory. In reality, no DOM parsers to date is able
> to read XML files larger than about 500 M. If I am wrong
> about this, I bet that someone will correct me.
>
>> Is there a better way to deliver large sets of binary files (i.e. zip
>> files or something like that)?
>
> I store such files in .gz format. When reading them, it
> is a good idea _not_ to unzip them. Use gzip to produce
> a stream of data which will be immediately processed by
> the SAX parser:
>
>  gzip -c large_file.xml | parser ...
>
> The advantage of this approach is that at each time instant,
> only part of the file will occupy space in memory. This is
> extremely fast and your server can run a hundred of such
> processes on each CPU in parallel. 




transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent