Altova Mailing List Archives>Archive Index >xml-dev Archive Home >Recent entries >Thread Prev - [xml-dev] Parsing bad HTML [Thread Next] Re: [xml-dev] Parsing bad HTMLTo: Paul M <pjmaip@-----.---> Date: 11/13/2008 8:50:00 PM For parsing purposes, I has written a Java function to convert bad html to just well formed xml (the resulting xml is not xhtml...). It can be easily modified to correct < characters between valid html tags. This is an opensource project : http://sourceforge.net/projects/light-html2xml Alain COUTHURES <agenceXML> Bordeaux, France Browser-side XForms without plug-in : http://www.agencexml.com/xsltforms Paul M a écrit : > I use tidy to clean up bad html docs. It does a pretty good job of > converting html => strict xthml > > However, the following is a bit too much > > <p> > <sub>123</sub>4567<eight<img src="file.gif" alt="<b>hello</b>"> > </p> > > The problem is with 7<eight. Stray < and > seem to make tidy choke. > What is the best method of handling this? I am leaning toward perl and > regexp, but am hoping to avoid this. Maybe a Java solution? And tidy > solutions? > > -thanks > | ||||||
| Company | Legal | Press | Partners | Careers | Sitemap | Contact Us | Altova Blog | Mobile | Full Site | |||
|
