XPath/XQuery tokenize function

Summary

Returns a sequence of strings constructed by splitting the input wherever a separator is found; the separator is any substring that matches a given regular expression.

Signatures

fn:tokenize(
$input as xs:string?
) as xs:string*
fn:tokenize(
$input as xs:string?,
$pattern as xs:string
) as xs:string*
fn:tokenize(
$input as xs:string?,
$pattern as xs:string,
$flags as xs:string
) as xs:string*

Properties

This function is deterministic, context-independent, and focus-independent.

Rules

The one-argument form of this function splits the supplied string at whitespace boundaries. More specifically, calling fn:tokenize($input) is equivalent to calling fn:tokenize(fn:normalize-space($input), ' ')) where the second argument is a single space character (x20).

The effect of calling the two-argument form of this function (omitting the argument $flags) is the same as the effect of calling the three-argument version with the $flags argument set to a zero-length string. Flags are defined in .

The following rules apply to the three-argument form of the function:

  • The $flags argument is interpreted in the same way as for the fn:matches function.

  • If $input is the empty sequence, or if $input is the zero-length string, the function returns the empty sequence.

  • The function returns a sequence of strings formed by breaking the $input string into a sequence of strings, treating any substring that matches $pattern as a separator. The separators themselves are not returned.

  • Except with the one-argument form of the function, if a separator occurs at the start of the $input string, the result sequence will start with a zero-length string. Similarly, zero-length strings will also occur in the result sequence if a separator occurs at the end of the $input string, or if two adjacent substrings match the supplied $pattern.

  • If two alternatives within the supplied $pattern both match at the same position in the $input string, then the match that is chosen is the first. For example:

    fn:tokenize("abracadabra", "(ab)|(a)") returns ("", "r", "c", "d", "r", "")

Examples

The expression fn:tokenize(" red green blue ") returns ("red", "green", "blue").

The expression fn:tokenize("The cat sat on the mat", "\s+") returns ("The", "cat", "sat", "on", "the", "mat").

The expression fn:tokenize(" red green blue ", "\s+") returns ("", "red", "green", "blue", "").

The expression fn:tokenize("1, 15, 24, 50", ",\s*") returns ("1", "15", "24", "50").

The expression fn:tokenize("1,15,,24,50,", ",") returns ("1", "15", "", "24", "50", "").

fn:tokenize("abba", ".?") raises the dynamic error .

The expression fn:tokenize("Some unparsed <br> HTML <BR> text", "\s*<br>\s*", "i") returns ("Some unparsed", "HTML", "text").

Error Conditions

A dynamic error is raised if the value of $pattern is invalid according to the rules described in section .

A dynamic error is raised if the value of $flags is invalid according to the rules described in section .

A dynamic error is raised if the supplied $pattern matches a zero-length string, that is, if fn:matches("", $pattern, $flags) returns true.

Notes

If the input string is not zero length, and no separators are found in the input string, the result of the function is a single string identical to the input string.

The one-argument form of the function has a similar effect to the two-argument form with \s+ as the separator pattern, except that the one-argument form strips leading and trailing whitespace, whereas the two-argument form delivers an extra zero-length token if leading or trailing whitespace is present.

The function returns no information about the separators that were found in the string. If this information is required, the fn:analyze-string function can be used instead.

The separator used by the one-argument form of the function is any sequence of tab (x09), newline (x0A), carriage return (x0D) or space (x20) characters. This is the same as the separator recognized by list-valued attributes as defined in XSD. It is not the same as the separator recognized by list-valued attributes in HTML5, which also treats form-feed (x0C) as whitespace. If it is necessary to treat form-feed as a separator, an explicit separator pattern should be used.