Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: PHP 5, XSL transformations of big files

From: "Neil Smith [MVP Digital Media]" <neil@------.--->
To: NULL
Date: 5/5/2008 10:07:00 PM

On Sun, 4 May 2008 13:12:49 -0700 (PDT), Meglio <x.meglio@g...>
wrote:

>>
>> >> > No problem with time execution, no problem with memory limits. The
>> >> > problem is script execution time - it do not completed even in 10
>> >> > minutes.
>>
>> >> You had three quick responses yet seem to have picked on the one that least
>> >> suited you, did any of the other suggestions help?
>>
>> >No. I do not know how to move forward to solve this problem.


>> In the meantime, profiling PHP using XDebug and analysing the output
>I have not skills to use CacheGrind and XDebug with PHP

Actually, you can just run those on a windows workstation with minimal
effort - cachegrind just reads the xdebug extension's output profile -
I set those up last week first time in about 20 minutes. I'm sure you
could too.


>> I'll take a look at the XSL to see if it's doing unnecessary work.


Well the bad news is the XSLT you provided is doing at least twice the
work it needs to - the are two instances of the call to process
<xsl:for-each select="./Category"> at the start of the stylesheet.

The next things I noticed are 

(1) you're using XML default output - setting <xsl:output type="text"
/> would be more appopriate for generating a SQL script for
consumption by the DB server.

(2) You're using xsl:for-each, which is slower than using
xsl:apply-templates in pretty much all XSL processors I can think of

(3) You're unnecessarily asking the processor to traverse to the
current parent node (which you're already in) by using the syntax like
<xsl:value-of select="./Gravity" /> - try - 
<xsl:value-of select="Gravity" /> instead

(4) You're doing a lot of tail recursion to quote strings, and the
actual quoting can still be bypassed in several ways (which is a
security risk if you don't trust the source provider)


As Joe noted, PHP5 supports XMLTextReader. In the document above,
you're not doing anything other than a forward read - there's no need
to invoke XSL to transform to flat text, as other tools exist.

Similarly, although SimpleXML might look attractive, it will also load
the entire document into memory, which isn't what we want here.

I spent a while to run through this and generated the following
script, which is close to your requirements (obviously run locally
checking SQL field order and that no unexpected blank entries are
made). It outputs the SQL lines as fast as possible, and discards used
data by resetting sitenodes array on each pass. 

Probably you'd want to use this as a starting point to create a class
rather than just a bunch of recursive functions.

Note we're using PDO to generate the quoted string (enable that
extension, which you should be using anyway), and only quoting the
listed strings in your original XSL (adding Id field, which you forgot
to quote in the original XSL - which is a string)

After a few optimisations, I ran this on your data in 170 seconds -
including building the SQL output script (14.5MB). I did that on a
Celeron 400 Mhz with 160MB of memory, PHP5, MySQL5 and Apache 2. 

It should be a very small part of a minute on a modern machine ;-))



HTH
Cheers - Neil


================================================


<?php

error_reporting(E_ALL);
ini_set('max_execution_time', 3600);
$start = microtime(true);


define('CONFIG_OUTPUT_SQL', 'D:\\htdocs\\XML\\sqlout.sql');
define('CONFIG_SQL_USER', 'your_db_username');
define('CONFIG_SQL_PASS', 'your_db_password');
define('CONFIG_SQL_CONN',
'mysql:host=127.0.0.1;port=3306;dbname=your_default_db');
define('CONFIG_INPUT_XML',
'D:\\htdocs\\XML\\marketplace_feed_v1.xml');

$conn = new PDO(CONFIG_SQL_CONN, CONFIG_SQL_USER, CONFIG_SQL_PASS);

$fp = fopen(CONFIG_OUTPUT_SQL, 'w+');

$reader = new XMLReader;
$reader->open(CONFIG_INPUT_XML);

while ($reader->read()) {
	if ($reader->nodeType == XMLReader::ELEMENT && $reader->name
== 'Category') {
		parseCategoryNodes();
	}
}


$reader->close();
$conn->close();
fclose($fp);


function parseCategoryNodes() {

	global $reader, $conn, $fp;

	$sQuery = "INSERT INTO marketplace_tmp (category, subcategory,
id, title, description, popularityrank, hasrecurringproducts, gravity,
earnedpersale, percentpersale, totalearningspersale, totalrebillamt,
referred, commission) VALUES ";

	while ($reader->read()) {

		if ($reader->nodeType == XMLReader::ELEMENT) {
			switch ($reader->name) {
				case 'Name' :
					$reader->read();
					if ($reader->nodeType ==
XMLReader::TEXT) {
						$current_category =
$conn->quote($reader->value);
					}
					break;

				case 'Site' :
					$sitenodes = parseSiteNodes();

					$query = $sQuery. "(";
					$query .= $current_category.",
";
					$query .= $current_category.",
";
					$query .= join(", ",
$sitenodes);
					$query .= ")\r\n";
					fwrite($fp, $query);

			}	//	End switch ($reader->name)
		}
	}	//	End while ($reader->read())

	fclose($fp);
}


function parseSiteNodes() {

	global $reader, $conn;
	$sitenodes = array();

	while ($reader->read()) {
//	Check if we reached the end of the 'Site' container node,
return the resultset
		if ($reader->nodeType == XMLReader::END_ELEMENT &&
$reader->name == 'Site') {
			return $sitenodes;
		}

		if ($reader->nodeType == XMLReader::ELEMENT) {
//	DB field names happen to be lower case here (but needn't be
for MySQL) 
			$nodename = strtolower($reader->name);
//	Read next text node which follows the field name element
			$reader->read();

			if ($reader->nodeType == XMLReader::TEXT ||
$reader->nodeType == XMLReader::CDATA) {

//	Process site nodes with category info into quoted DB strings
depending on driver
				if ($nodename == 'id' || $nodename ==
'title' || $nodename == 'description') {
					$sitenodes[$nodename] =
$conn->quote($reader->value);
				} else {
					$sitenodes[$nodename] =
($reader->value == '' ? 0 : $reader->value);
				}
			}
		}
	}
}


$end = microtime(true);

print("Duration : ".(($end - $start) * 1000)." msec");
?>


================================================
------------------------------------------------
Digital Media MVP : 2004-2008
http://mvp.support.microsoft.com/mvpfaqs


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent