Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: [xml-dev] [OT] bugs in JDK regex engine ?

From: Amelia A Lewis <amyzing@--------.--->
To: xml-dev@-----.---.---
Date: 2/4/2008 4:18:00 AM
On 2008-02-03 23:26:58 -0500 "Mukul Gandhi" <gandhi.mukul@g...> 
wrote:
> String str = "<root><abc x='1'>text1</abc><pqr 
> y='1'>text2</pqr></root>";
> 
> Pattern pattern = Pattern.compile("<[^/]+>");  //anything from '<' to
> '>', and not having '/'
> Matcher matcher = pattern.matcher(str);
> 
> while (matcher.find()) {
>    String group = matcher.group();
>    System.out.println(group);
> }
> 
> 'str' is a String representation of a XML fragment.
> 
> I want to extract all pieces from the string (the tokens), which form
> a start tag (including attribute parts).
> 
> I am expecting output:
> <root>
> <abc x='1'>
> <pqr y='1'>

But that's not what you asked for.  You said "longest string starting 
with '<' and ending with '>' that doesn't contain '/'.

> But the output produced by the above program is:
> <root><abc x='1'>
> <pqr y='1'>

Yup.  Exactly matches the regex.  No / in either one, is there?  
Specifically, even though you think you asked for "just the start 
tag," you have <abc> nested inside <root>; there's no / anywhere 
around to prevent the regex from matching to the end of <abc>

The problem with using regular expressions to parse any grammar with 
paired tokens (XML for example, but also most programming languages 
with paired braces of any sort, or comments in a language that permits 
comment nesting) is that regular expressions can't handle parity.

You need something more powerful than regex.

If you're determined to find the next layer of problems associated 
with using a too-weak tool to do the job, you should find it shortly 
after making this change:

Pattern.compile("<[^/<]+>");

That prevents it from picking up a nested element tag.  Most of the 
time.

For giggles:

<root><?my-pi wotsit ?><abc x='1'><![CDATA[<?xml version="1.0?>
<root><abc x='1'>text1]]></abc>
</root>

HTH.

Amy!
-- 
Amelia A. Lewis                    amyzing {at} talsever.com
Confidence: a feeling peculiar to the stage just before full
comprehension of the problem.


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent