Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: [xsl] How to split an RegEx into several lines for readability?

From: Abel Braaksma <abel.online@--------->
To:
Date: 5/1/2007 5:00:00 PM
Dimitre Novatchev wrote:
As I am an absolute RegEx beginner, please excuse me if this is a
trivial question.

A good thing to know about regexes is that, besides being powerful, they 
can be very dangerous too, esp. to the unaware, when backtracking causes 
the regex to run with exponential times for non-matching strings. An 
example of such a regex is in this post: 
http://www.nabble.com/Certain-non-zero-length-non-matching-regexes-run-forever-on-Saxon-tf3065127.html#a8524868



If you are going to use regexes in a production environment make sure to 
test them thoroughly for this behavior or your processor may hang 
occasionally.





Is there any way I can split this RegEx on separate lines and/or add
whitespace so that it would be more readable?

You already heard of the 'x' modifier, but there are a few things that 
you should know before splitting your regex into a more readable format:



 * If you use Saxon, several bugs concerning whitespace handling have 
been fixed in the 8.8 and 8.9 release, some of which you may consider 
significant, like this one, which is now fixed: 
http://www.nabble.com/Bug%3A-whitespace-at-beginning-of-regex-fails-the-regex-when-in-%27x%27-%28ignore-whitespace%29-mode-tf2870226.html#a8022584



 * The "ignore whitespace" is very literally so. I.e., in XSLT regexes, 
this:  fn:matches("hello world", "hello\ sworld", "x") returns true. The 
"\ s" part in the regex is, with whitespace removed, "\s" and matches a 
space. Most regex engines (Perl for one) consider an escaped space as a 
space.



  * The only place where you must be aware of whitespace with 'x' on i 
inside classes, where it is not ignored: [abc ] matches 'a', 'b', 'c' or 
' '.



  * You probably don't want to do this, but this is allowed with the 
'x' modifier: "\p{ I s B a s i c L a t i n }+" and is the same as 
"\p{IsBasicLatin}+".




And a tip for making your regexes more readable: introduce comments 
inside your regexes. In other regex languages you can do that inside the 
regex language, but not with a regex in XSLT. You can easily fix this by 
putting your regexes inside a variable and always calling them with the 
'x' modifier:



<xsl:variable name="myregex" as="xs:string">
   (          <!-- grab everything -->
   "          <!-- start of a q. string -->
   [^"]*      <!-- zero or more non-quotes -->
   "          <!-- end of a q. string -->
   )          <!-- closing 'grab all' -->
</xsl:variable>


I use this method to some extend in a format that allows recursive and 
repetitive regexes on input by just supplying a 'parser' written in XSLT 
with a set of regexes placed in XML that are then applied to the input. 
If you have many regexes, you will find that it is easier to maintain 
them by working on some library and reuse.



Cheers,
-- Abel Braaksma


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent