XPath/XQuery format-integer function

Summary

Formats an integer according to a given picture string, using the conventions of a given natural language if specified.

Signatures

fn:format-integer(
$value as xs:integer?,
$picture as xs:string
) as xs:string
fn:format-integer(
$value as xs:integer?,
$picture as xs:string,
$lang as xs:string?
) as xs:string

Properties

The two-argument form of this function is deterministic, context-dependent, and focus-independent. It depends on default-language.
The three-argument form of this function is deterministic, context-independent, and focus-independent.

Rules

If $value is an empty sequence, the function returns a zero-length string.

In all other cases, the $picture argument describes the format in which $value is output.

The rules that follow describe how non-negative numbers are output. If the value of $value is negative, the rules below are applied to the absolute value of $value, and a minus sign is prepended to the result.

The value of $picture consists of a primary format token, optionally followed by a format modifier. The primary format token is always present and must not be zero-length. If the string contains one or more semicolons then everything that precedes the last semicolon is taken as the primary format token and everything that follows is taken as the format modifier; if the string contains no semicolon then the entire picture is taken as the primary format token, and the format modifier is taken to be absent (which is equivalent to supplying a zero-length string).

The primary format token is classified as one of the following:

  1. A decimal-digit-pattern made up of optional-digit-signs, mandatory-digit-signs, and grouping-separator-signs.

    The optional-digit-sign is the character "#".

    A mandatory-digit-sign is a character in Unicode category Nd. All mandatory-digit-signs within the format token must be from the same digit family, where a digit family is a sequence of ten consecutive characters in Unicode category Nd, having digit values 0 through 9. Within the format token, these digits are interchangeable: a three-digit number may thus be indicated equivalently by 000, 001, or 999.

    a grouping-separator-sign is a non-alphanumeric character, that is a character whose Unicode category is other than Nd, Nl, No, Lu, Ll, Lt, Lm or Lo.

    If the primary format token contains at least one Unicode digit then it is taken as a decimal digit pattern, and in this case it must match the regular expression ^((\p{Nd}|#|[^\p{N}\p{L}])+?)$. If it contains a digit but does not match this pattern, a dynamic error is raised .

    If a semicolon is to be used as a grouping separator, then the primary format token as a whole must be followed by another semicolon, to ensure that the grouping separator is not mistaken as a separator between the primary format token and the format modifier.

    There must be at least one mandatory-digit-sign. There may be zero or more optional-digit-signs, and (if present) these must precede all mandatory-digit-signs. There may be zero or more grouping-separator-signs. A grouping-separator-sign must not appear at the start or end of the decimal-digit-pattern, nor adjacent to another grouping-separator-sign.

    The corresponding output format is a decimal number, using this digit family, with at least as many digits as there are mandatory-digit-signs in the format token. Thus, a format token 1 generates the sequence 0 1 2 ... 10 11 12 ..., and a format token 01 (or equivalently, 00 or 99) generates the sequence 00 01 02 ... 09 10 11 12 ... 99 100 101. A format token of ١ (Arabic-Indic digit one) generates the sequence ١ then ٢ then ٣ ...

    The grouping-separator-signs are handled as follows:

    The position of grouping separators within the format token, counting backwards from the last digit, indicates the position of grouping separators to appear within the formatted number, and the character used as the grouping-separator-sign within the format token indicates the character to be used as the corresponding grouping separator in the formatted number.

    More specifically, the position of a grouping separator is the number of optional-digit-signs and mandatory-digit-signs appearing between the grouping separator and the right-hand end of the primary format token.

    Grouping separators are defined to be regular if the following conditions apply:

    There is at least one grouping separator.

    Every grouping separator is the same character (call it C).

    There is a positive integer G (the grouping size) such that:

    The position of every grouping separator is an integer multiple of G, and

    Every positive integer multiple of G that is less than the number of optional-digit-signs and mandatory-digit-signs in the primary format token is the position of a grouping separator.

    The grouping separator template is a (possibly infinite) set of (position, character) pairs.

    If grouping separators are regular, then the grouping separator template contains one pair of the form (n×G, C) for every positive integer n where G is the grouping size and C is the grouping character.

    Otherwise (when grouping separators are not regular), the grouping separator template contains one pair of the form (P, C) for every grouping separator found in the primary formatting token, where C is the grouping separator character and P is its position.

    If there are no grouping separators, then the grouping separator template is an empty set.

    The number is formatted as follows:

    Let S/1 be the result of formatting the supplied number in decimal notation as if by casting it to xs:string.

    Let S/2 be the result of padding S/1 on the left with as many leading zeroes as are needed to ensure that it contains at least as many digits as the number of mandatory-digit-signs in the primary format token.

    Let S/3 be the result of replacing all decimal digits (0-9) in S/2 with the corresponding digits from the selected digit family.

    Let S/4 be the result of inserting grouping separators into S/3: for every (position P, character C) pair in the grouping separator template where P is less than the number of digits in S/3, insert character C into S/3 at position P, counting from the right-hand end.

    Let S/5 be the result of converting S/4 into ordinal form, if an ordinal modifier is present, as described below.

    The result of the function is then S/5.

  2. The format token A, which generates the sequence A B C ... Z AA AB AC....

  3. The format token a, which generates the sequence a b c ... z aa ab ac....

  4. The format token i, which generates the sequence i ii iii iv v vi vii viii ix x ....

  5. The format token I, which generates the sequence I II III IV V VI VII VIII IX X ....

  6. The format token w, which generates numbers written as lower-case words, for example in English, one two three four ...

  7. The format token W, which generates numbers written as upper-case words, for example in English, ONE TWO THREE FOUR ...

  8. The format token Ww, which generates numbers written as title-case words, for example in English, One Two Three Four ...

  9. Any other format token, which indicates a numbering sequence in which that token represents the number 1 (one) (but see the note below). It is implementation-defined which numbering sequences, additional to those listed above, are supported. If an implementation does not support a numbering sequence represented by the given token, it must use a format token of 1.

    In some traditional numbering sequences additional signs are added to denote that the letters should be interpreted as numbers; these are not included in the format token. An example (see also the example below) is classical Greek where a dexia keraia (x0374, ʹ) and sometimes an aristeri keraia (x0375, ͵) is added.

For all format tokens other than a decimal-digit-pattern, there may be implementation-defined lower and upper bounds on the range of numbers that can be formatted using this format token; indeed, for some numbering sequences there may be intrinsic limits. For example, the format token ① (circled digit one, ①) has a range imposed by the Unicode character repertoire — zero to 20 in Unicode versions prior to 3.2, or zero to 50 in subsequent versions. For the numbering sequences described above any upper bound imposed by the implementation must not be less than 1000 (one thousand) and any lower bound must not be greater than 1. Numbers that fall outside this range must be formatted using the format token 1.

The above expansions of numbering sequences for format tokens such as a and i are indicative but not prescriptive. There are various conventions in use for how alphabetic sequences continue when the alphabet is exhausted, and differing conventions for how roman numerals are written (for example, IV versus IIII as the representation of the number 4). Sometimes alphabetic sequences are used that omit letters such as i and o. This specification does not prescribe the detail of any sequence other than those sequences consisting entirely of decimal digits.

Many numbering sequences are language-sensitive. This applies especially to the sequence selected by the tokens w, W and Ww. It also applies to other sequences, for example different languages using the Cyrillic alphabet use different sequences of characters, each starting with the letter #x410 (Cyrillic capital letter A). In such cases, the $lang argument specifies which language's conventions are to be used. If the argument is specified, the value should be either an empty sequence or a value that would be valid for the xml:lang attribute (see ). Note that this permits the identification of sublanguages based on country codes (from ISO 3166-1) as well as identification of dialects and regions within a country.

The set of languages for which numbering is supported is implementation-defined. If the $lang argument is absent, or is set to an empty sequence, or is invalid, or is not a language supported by the implementation, then the number is formatted using the default language from the dynamic context.

The format modifier must be a string that matches the regular expression ^([co](\(.+\))?)?[at]?$. That is, if it is present it must consist of one or more of the following, in order:

  • either c or o, optionally followed by a sequence of characters enclosed between parentheses, to indicate cardinal or ordinal numbering respectively, the default being cardinal numbering

  • either a or t, to indicate alphabetic or traditional numbering respectively, the default being implementation-defined.

If the o modifier is present, this indicates a request to output ordinal numbers rather than cardinal numbers. For example, in English, when used with the format token 1, this outputs the sequence 1st 2nd 3rd 4th ..., and when used with the format token w outputs the sequence first second third fourth ....

The string of characters between the parentheses, if present, is used to select between other possible variations of cardinal or ordinal numbering sequences. The interpretation of this string is implementation-defined. No error occurs if the implementation does not define any interpretation for the defined string.

It is implementation-defined what combinations of values of the format token, the language, and the cardinal/ordinal modifier are supported. If ordinal numbering is not supported for the combination of the format token, the language, and the string appearing in parentheses, the request is ignored and cardinal numbers are generated instead.

The use of the a or t modifier disambiguates between numbering sequences that use letters. In many languages there are two commonly used numbering sequences that use letters. One numbering sequence assigns numeric values to letters in alphabetic sequence, and the other assigns numeric values to each letter in some other manner traditional in that language. In English, these would correspond to the numbering sequences specified by the format tokens a and i. In some languages, the first member of each sequence is the same, and so the format token alone would be ambiguous. In the absence of the a or t modifier, the default is implementation-defined.

Examples

The expression format-integer(123, '0000') returns "0123".

format-integer(123, 'w') might return "one hundred and twenty-three"

Ordinal numbering in Italian: The specification "1;o(-º)" with $lang equal to it, if supported, should produce the sequence:

1º 2º 3º 4º ...

The specification "Ww;o" with $lang equal to it, if supported, should produce the sequence:

Primo Secondo Terzo Quarto Quinto ...

The expression format-integer(21, '1;o', 'en') returns "21st".

format-integer(14, 'Ww;o(-e)', 'de') might return "Vierzehnte"

The expression format-integer(7, 'a') returns "g".

The expression format-integer(57, 'I') returns "LVII".

The expression format-integer(1234, '#;##0;') returns "1;234".

Error Conditions

A dynamic error is raised if the format token is invalid, that is, if it violates any mandatory rules (indicated by an emphasized must or required keyword in the above rules). For example, the error is raised if the primary format token contains a digit but does not match the required regular expression.

Notes

Note the careful distinction between conditions that are errors and conditions where fallback occurs. The principle is that an error in the syntax of the format picture will be reported by all processors, while a construct that is recognized by some implementations but not others will never result in an error, but will instead cause a fallback representation of the integer to be used. The following notes apply when a decimal-digit-pattern is used: If grouping-separator-signs appear at regular intervals within the format token, then the sequence is extrapolated to the left, so grouping separators will be used in the formatted number at every multiple of N. For example, if the format token is 0'000 then the number one million will be formatted as 1'000'000, while the number fifteen will be formatted as 0'015. The only purpose of optional-digit-signs is to mark the position of grouping-separator-signs. For example, if the format token is #'##0 then the number one million will be formatted as 1'000'000, while the number fifteen will be formatted as 15. A grouping separator is included in the formatted number only if there is a digit to its left, which will only be the case if either (a) the number is large enough to require that digit, or (b) the number of mandatory-digit-signs in the format token requires insignificant leading zeros to be present. Grouping separators are not designed for effects such as formatting a US telephone number as (365)123-9876. In general they are not suitable for such purposes because (a) only single characters are allowed, and (b) they cannot appear at the beginning or end of the number. Numbers will never be truncated. Given the decimal-digit-pattern 01, the number three hundred will be output as 300, despite the absence of any optional-digit-sign. The following notes apply when ordinal numbering is selected using the o modifier. In some languages, the form of numbers (especially ordinal numbers) varies depending on the grammatical context: they may have different genders and may decline with the noun that they qualify. In such cases the string appearing in parentheses after the letter c or o may be used to indicate the variation of the cardinal or ordinal number required. The way in which the variation is indicated will depend on the conventions of the language. For inflected languages that vary the ending of the word, the approach recommended in the previous version of this specification was to indicate the required ending, preceded by a hyphen: for example in German, appropriate values might be o(-e), o(-er), o(-es), o(-en). Another approach, which might usefully be adopted by an implementation based on the open-source ICU localization library , or any other library making use of the Unicode Common Locale Data Repository , is to allow the value in parentheses to be the name of a registered numbering rule set for the language in question, conventionally prefixed with a percent sign: for example, o(%spellout-ordinal-masculine), or c(%spellout-cardinal-year).