Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: PHP 5, XSL transformations of big files

From: Meglio <x.meglio@-----.--->
To: NULL
Date: 5/12/2008 9:36:00 AM
On May 6, 1:07 am, "Neil Smith [MVP Digital Media]" <n...@nospam.com>
wrote:
> On Sun, 4 May 2008 13:12:49 -0700 (PDT), Meglio <x.meg...@gmail.com>
> wrote:
>
>
>
> >> >> > No problem with time execution, no problem with memory limits. The
> >> >> > problem is script execution time - it do not completed even in 10
> >> >> > minutes.
>
> >> >> You had three quick responses yet seem to have picked on the one that least
> >> >> suited you, did any of the other suggestions help?
>
> >> >No. I do not know how to move forward to solve this problem.
> >> In the meantime, profiling PHP using XDebug and analysing the output
> >I have not skills to use CacheGrind and XDebug with PHP
>
> Actually, you can just run those on a windows workstation with minimal
> effort - cachegrind just reads the xdebug extension's output profile -
> I set those up last week first time in about 20 minutes. I'm sure you
> could too.
>
> >> I'll take a look at the XSL to see if it's doing unnecessary work.
>
> Well the bad news is the XSLT you provided is doing at least twice the
> work it needs to - the are two instances of the call to process
> <xsl:for-each select="./Category"> at the start of the stylesheet.
>
> The next things I noticed are
>
> (1) you're using XML default output - setting <xsl:output type="text"
> /> would be more appopriate for generating a SQL script for
> consumption by the DB server.
>
> (2) You're using xsl:for-each, which is slower than using
> xsl:apply-templates in pretty much all XSL processors I can think of
>
> (3) You're unnecessarily asking the processor to traverse to the
> current parent node (which you're already in) by using the syntax like
> <xsl:value-of select="./Gravity" /> - try -
> <xsl:value-of select="Gravity" /> instead
>
> (4) You're doing a lot of tail recursion to quote strings, and the
> actual quoting can still be bypassed in several ways (which is a
> security risk if you don't trust the source provider)
>
> As Joe noted, PHP5 supports XMLTextReader. In the document above,
> you're not doing anything other than a forward read - there's no need
> to invoke XSL to transform to flat text, as other tools exist.
>
> Similarly, although SimpleXML might look attractive, it will also load
> the entire document into memory, which isn't what we want here.
>
> I spent a while to run through this and generated the following
> script, which is close to your requirements (obviously run locally
> checking SQL field order and that no unexpected blank entries are
> made). It outputs the SQL lines as fast as possible, and discards used
> data by resetting sitenodes array on each pass.
>
> Probably you'd want to use this as a starting point to create a class
> rather than just a bunch of recursive functions.
>
> Note we're using PDO to generate the quoted string (enable that
> extension, which you should be using anyway), and only quoting the
> listed strings in your original XSL (adding Id field, which you forgot
> to quote in the original XSL - which is a string)
>
> After a few optimisations, I ran this on your data in 170 seconds -
> including building the SQL output script (14.5MB). I did that on a
> Celeron 400 Mhz with 160MB of memory, PHP5, MySQL5 and Apache 2.
>
> It should be a very small part of a minute on a modern machine ;-))
>
> HTH
> Cheers - Neil
>
> ================================================
>
> <?php
>
> error_reporting(E_ALL);
> ini_set('max_execution_time', 3600);
> $start = microtime(true);
>
> define('CONFIG_OUTPUT_SQL', 'D:\\htdocs\\XML\\sqlout.sql');
> define('CONFIG_SQL_USER', 'your_db_username');
> define('CONFIG_SQL_PASS', 'your_db_password');
> define('CONFIG_SQL_CONN',
> 'mysql:host=127.0.0.1;port=3306;dbname=your_default_db');
> define('CONFIG_INPUT_XML',
> 'D:\\htdocs\\XML\\marketplace_feed_v1.xml');
>
> $conn = new PDO(CONFIG_SQL_CONN, CONFIG_SQL_USER, CONFIG_SQL_PASS);
>
> $fp = fopen(CONFIG_OUTPUT_SQL, 'w+');
>
> $reader = new XMLReader;
> $reader->open(CONFIG_INPUT_XML);
>
> while ($reader->read()) {
>         if ($reader->nodeType == XMLReader::ELEMENT && $reader->name
> == 'Category') {
>                 parseCategoryNodes();
>         }
>
> }
>
> $reader->close();
> $conn->close();
> fclose($fp);
>
> function parseCategoryNodes() {
>
>         global $reader, $conn, $fp;
>
>         $sQuery = "INSERT INTO marketplace_tmp (category, subcategory,
> id, title, description, popularityrank, hasrecurringproducts, gravity,
> earnedpersale, percentpersale, totalearningspersale, totalrebillamt,
> referred, commission) VALUES ";
>
>         while ($reader->read()) {
>
>                 if ($reader->nodeType == XMLReader::ELEMENT) {
>                         switch ($reader->name) {
>                                 case 'Name' :
>                                         $reader->read();
>                                         if ($reader->nodeType ==
> XMLReader::TEXT) {
>                                                 $current_category =
> $conn->quote($reader->value);
>                                         }
>                                         break;
>
>                                 case 'Site' :
>                                         $sitenodes = parseSiteNodes();
>
>                                         $query = $sQuery. "(";
>                                         $query .= $current_category.",
> ";
>                                         $query .= $current_category.",
> ";
>                                         $query .= join(", ",
> $sitenodes);
>                                         $query .= ")\r\n";
>                                         fwrite($fp, $query);
>
>                         }       //      End switch ($reader->name)
>                 }
>         }       //      End while ($reader->read())
>
>         fclose($fp);
>
> }
>
> function parseSiteNodes() {
>
>         global $reader, $conn;
>         $sitenodes = array();
>
>         while ($reader->read()) {
> //      Check if we reached the end of the 'Site' container node,
> return the resultset
>                 if ($reader->nodeType == XMLReader::END_ELEMENT &&
> $reader->name == 'Site') {
>                         return $sitenodes;
>                 }
>
>                 if ($reader->nodeType == XMLReader::ELEMENT) {
> //      DB field names happen to be lower case here (but needn't be
> for MySQL)
>                         $nodename = strtolower($reader->name);
> //      Read next text node which follows the field name element
>                         $reader->read();
>
>                         if ($reader->nodeType == XMLReader::TEXT ||
> $reader->nodeType == XMLReader::CDATA) {
>
> //      Process site nodes with category info into quoted DB strings
> depending on driver
>                                 if ($nodename == 'id' || $nodename ==
> 'title' || $nodename == 'description') {
>                                         $sitenodes[$nodename] =
> $conn->quote($reader->value);
>                                 } else {
>                                         $sitenodes[$nodename] =
> ($reader->value == '' ? 0 : $reader->value);
>                                 }
>                         }
>                 }
>         }
>
> }
>
> $end = microtime(true);
>
> print("Duration : ".(($end - $start) * 1000)." msec");
> ?>
>
> ================================================
> ------------------------------------------------
> Digital Media MVP : 2004-2008http://mvp.support.microsoft.com/mvpfaqs

Hi, Neil.
Thank you again for you comments and advices.

I improved your PHP script (you do not take into account that there
are subcategories inside categories) and I embedded it to my
automation and it works good and fast now (up to 1 minute).

Thank you very much for you help! I can donate some encouragement if
you have PayPal or Google Checkout account ;) You helped me a lot
really!!

Anton


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent