Altova Mailing List Archives


Re: Converting poorly formed HTML into well-formed XML

From: "Steve Muench" <smuench@------------->
To:
Date: 9/26/2000 9:56:00 AM
| Does XSLT have the facilities to directly 
| read in the poorly formed HTML?

No built-in features to do this.

I'd recommend leveraging Andy Quick's excellent (open source)
Java implementation of Dave Raggett's HTML "Tidy" utility called
JTidy.

http://www3.sympatico.ca/ac.quick/jtidy.html

It can expose a DOM API to the "tidied-up" (that is, well-formed)
XML tree for any ill-formed HTML document. You can then pass
the DOM Document into your XSLT engine for transformation.

In my about-to-be-released book "Building Oracle XML Applications"
from O'Reilly, I had occasion to use this JTidy library to show
readers how to take ill-formed HTML and use XSLT to "scrape" 
interesting data out of the "tidied"-up XML result from dynamic
web pages like stock quote services or other online sources of 
information.

______________________________________________________________
Steve Muench, Lead XML Evangelist & Consulting Product Manager
BC4J & XSQL Servlet Development Teams, Oracle Rep to XSL WG
Author "Building Oracle XML Applications", O'Reilly
http://www.oreilly.com/catalog/orxmlapp/


| Does XSLT have the facilities to directly read in the poorly formed HTML?
| And if so, what needs to be done.
| 
| Or,
| 
| Will designing a custom parser that builds a DOM from the poorly formed HTML
| to then be output to an XML file, or directly processed by an XSLT document,
| be the best solution.
| 
| I've already begun developing the latter (custom) solution, but thought I'd
| double check to see if there are any HTML -> XHTML converters available.
| 
| Thanks in advance for your help,
| 
| Joe Fourness
| 
| 
|  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
| 


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

Disclaimer

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.