Regular expressions

www.altova.com Print this Topic Previous Page Up One Level Next page

Home >  Functions >

Regular expressions

MapForce can use regular expressions in the pattern parameter of the match-pattern and tokenize-regexp functions, to find specific strings. Some regular expression functionality is also available when you need to filter the nodes on which a node function or default should apply, see Applying Node Functions and Defaults Conditionally.

 

The regular expression syntax and semantics for XSLT and XQuery are identical to those defined in https://www.w3.org/TR/xmlschema-2/. Please note that there are slight differences in regular expression syntax between the various programming languages.

 

Terminology


input

the string that the regex works on


pattern

the regular expression


flags

optional parameter to define how the regular expression is to be interpreted


result

the result of the function

regex6

Tokenize-regexp returns a sequence of strings. The connection to the Rows item creates one row per item in the sequence.

 

regex syntax

Literals e.g. a single character:

e.g. The letter "a" is the most basic regex. It matches the first occurrence of the character "a" in the string.

 

Character classes []

This is a set of characters enclosed in square brackets.

 

One, and only one, of the characters in the square brackets are matched.

 

pattern

[aeiou]

Matches a lowercase vowel.

 

pattern

[mj]ust

Matches must or just

 

Please note that "pattern" is case sensitive, a lower case a does not match the uppercase A.

 

 

Character ranges [a-z]

Creates a range between the two characters. Only one of the characters will be matched at one time.

 

pattern

[a-z]

Matches any lowercase characters between a and z.

 

 

negated classes [^]

using the caret as the first character after the opening bracket, negates the character class.

 

pattern

[^a-z]

Matches any character not in the character class, including newlines.

 

Meta characters "."

Dot meta character

matches any single character (except for newline)

 

pattern

.

Matches any single character.

 

Quantifiers ? + * {}

Quantifiers define how often a regex component must repeat within the input string, for a match to occur.

 


?



zero or one

preceding string/chunk is optional





+



one or more

preceding string/chunks may match one or more times





*



zero or more

preceding string/chunks may match zero or more times





{}



min / max
repetitions

no. of repetitions a string/chunks has to match



e.g. mo{1,3} matches mo, moo, mooo.

 

 

()        

subpatterns        

parentheses are used to group parts of a regex together.

 

|        

Alternation/or        allows the testing of subexpressions form left to right.

(horse|make) sense - will match "horse sense" or "make sense"

 

Flags

These are optional parameters that define how the regular expression is to be interpreted. Individual letters are used to set the options, i.e. the character is present. Letters may be in any order and can be repeated.

 

s

If present, the matching process will operate in the "dot-all" mode.

 

The meta character "." matches any character whatsoever. If the input string contains "hello" and "world" on two different lines, the regular expression "hello*world" will only match if the s flag/character is set.

 

m

If present, the matching process operates in multi-line mode.

 

In multi-line mode the caret ^ matches the start of any line, i.e. the start of the entire string and the first character after a newline character.

 

The dollar character $ matches the end of any line, i.e. the end of the entire string and the character immediately before a newline character.

 

Newline is the character #x0A.

 

i

If present, the matching process operates in case-insensitive mode.

The regular expression [a-z] plus the i flag would then match all letters a-z and A-Z.

regex7

x

If present, whitespace characters are removed from the regular expression prior to the matching process. Whitespace chars. are #x09, #x0A, #x0D and #x20.

 

Exception:Whitespace characters within character class expressions are not removed e.g. [#x20].

 

Note:When generating code, the advanced features of the regex syntax might differ slightly between the various languages, please see the specific regex documentation for your language.

© 2019 Altova GmbH