XPath/XQuery normalize-unicode function

Summary

Returns the value of $arg after applying Unicode normalization.

Signatures

fn:normalize-unicode(
$arg as xs:string?
) as xs:string
fn:normalize-unicode(
$arg as xs:string?,
$normalizationForm as xs:string
) as xs:string

Properties

This function is deterministic, context-independent, and focus-independent.

Rules

If the value of $arg is the empty sequence, the function returns the zero-length string.

If the single-argument version of the function is used, the result is the same as calling the two-argument version with $normalizationForm set to the string "NFC".

Otherwise, the function returns the value of $arg normalized according to the rules of the normalization form identified by the value of $normalizationForm.

The effective value of $normalizationForm is the value of the expression fn:upper-case(fn:normalize-space($normalizationForm)).

  • If the effective value of $normalizationForm is NFC, then the function returns the value of $arg converted to Unicode Normalization Form C (NFC).

  • If the effective value of $normalizationForm is NFD, then the function returns the value of $arg converted to Unicode Normalization Form D (NFD).

  • If the effective value of $normalizationForm is NFKC, then the function returns the value of $arg in Unicode Normalization Form KC (NFKC).

  • If the effective value of $normalizationForm is NFKD, then the function returns the value of $arg converted to Unicode Normalization Form KD (NFKD).

  • If the effective value of $normalizationForm is FULLY-NORMALIZED, then the function returns the value of $arg converted to fully normalized form.

  • If the effective value of $normalizationForm is the zero-length string, no normalization is performed and $arg is returned.

Normalization forms NFC, NFD, NFKC, and NFKD, and the algorithms to be used for converting a string to each of these forms, are defined in .

The motivation for normalization form FULLY-NORMALIZED is explained in . However, as that specification did not progress beyond working draft status, the normative specification is as follows:

  • A string is fully-normalized if (a) it is in normalization form NFC as defined in , and (b) it does not start with a composing character.

  • A composing character is a character that is one or both of the following:

    the second character in the canonical decomposition mapping of some character that is not listed in the Composition Exclusion Table defined in ;

    of non-zero canonical combining class (as defined in ).

  • A string is converted to FULLY-NORMALIZED form as follows:

    if the first character in the string is a composing character, prepend a single space (x20);

    convert the resulting string to normalization form NFC.

Conforming implementations must support normalization form "NFC" and may support normalization forms "NFD", "NFKC", "NFKD", and "FULLY-NORMALIZED". They may also support other normalization forms with implementation-defined semantics.

It is implementation-defined which version of Unicode (and therefore, of the normalization algorithms and their underlying data) is supported by the implementation. See for details of the stability policy regarding changes to the normalization rules in future versions of Unicode. If the input string contains codepoints that are unassigned in the relevant version of Unicode, or for which no normalization rules are defined, the fn:normalize-unicode function leaves such codepoints unchanged. If the implementation supports the requested normalization form then it must be able to handle every input string without raising an error.

Error Conditions

A dynamic error is raised if the effective value of the $normalizationForm argument is not one of the values supported by the implementation.