Altova Mailing List Archives


Re: [xsl] Tokenize question: tokenize on words, spaces and

From: Brandon Ibach <brandon.ibach@---------------.--->
To: xsl-list <xsl-list@-----.------------.--->
Date: 3/17/2011 4:20:00 AM
The main trick here seems to be simply constructing an appropriate
character class for each type of token and then matching sequences of
one or more of each.

The following does just that, though it also tosses in a twist to
handle words with embedded dashes, so that the dash won't break the
word into three separate tokens.  Further adjustments along those
lines may be needed, depending on your requirements.

The use of Unicode character categories for the character classes
should ensure that this works for most languages, I think, though
non-English languages aren't my strong suit, so I make no guarantees.
:)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns:f="urn:stylesheet-func" exclude-result-prefixes="xs f">
    <xsl:output method="text"/>
    <xsl:param name="s" select="'Oh, what a fun-filled day!'"/>
    <xsl:function name="f:tokens" as="xs:string*">
        <xsl:param name="string"/>
        <xsl:analyze-string select="$string"
regex="{'\w[-\w]*|[\p{P}\p{C}]+|\p{Z}+'}">
            <xsl:matching-substring><xsl:sequence
select="."/></xsl:matching-substring>
        </xsl:analyze-string>
    </xsl:function>
    <xsl:template match="/">
        <xsl:text>('</xsl:text>
        <xsl:value-of select="f:tokens($s)" separator="', '"/>
        <xsl:text>')</xsl:text>
    </xsl:template>
</xsl:stylesheet>

-Brandon :)


On Wed, Mar 16, 2011 at 8:33 PM, Martin Holmes <mholmes@u...> wrote:
> Hi there,
>
> This is really a question for XPath regex gurus:
>
> I need to tokenize a string of text such that words, punctuation and spaces
> are split. So from this:
>
> Oh, what a great day!
>
> I need to get:
>
> ('Oh', ',', ' ', 'what', ' ', 'a', ' ', 'great', ' ', 'day', '!')
>
> I've been hacking away at this for a while, but regexps aren't my strong
> suit. Can anyone help?
>
> Cheers,
> Martin
>
>
> --~------------------------------------------------------------------
> XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
> To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
> or e-mail: <mailto:xsl-list-unsubscribe@l...>
> --~--
>
>

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe@l...>
--~--

Disclaimer

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.