Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: [xsl] regex, shortest match

From: David Carlisle <davidc@--------->
To:
Date: 8/1/2008 8:43:00 AM
> I'm looking to parse sentences out of paras.

to be more exact you are trying to parse a sentence with a regular
expression, which would cause you to fail a logic course as natural
language must be the canonical example of a non regular language:-)

> "((.+).)

. is a meta character matching any character so that is a sequence of
one or more characters, followed by a character, ie it's any sequence of
2 or more characters.




You need to define a sentence. If a sentemce can not contain a ".", but
always ends wiith a "." then you could do [^.]*\.

but then

it cost $2.00.

is two sentences.



So perhaps a sentence is terminated by . followed by end of string or
whitespace

 ([^.]|\.[^ \n\r\t])*\.(\s+|$)




<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 

<xsl:output method="text"/>

<xsl:template match="para">

new para
<xsl:analyze-string select="." regex="([^.]|\.[^ \n\r\t])*\.(\s+|$)">
<xsl:matching-substring>
 sentence: <xsl:value-of select="normalize-space(.)"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
 oops:  <xsl:value-of select="normalize-space(.)"/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>





 saxon9 para.xml para.xsl



new para

 sentence: It is sometimes desired to have a specific heading which should not be numbered.
 sentence: This corresponds to unnumbered list headers in lists (see sections 4.3).
 sentence: To facilitate this, an optional attribute text:is-list-header can be used.
 sentence: If true, the given header will not be numbered, even if an explicit list-style is given.


new para

 sentence: A text:style-name attribute references a paragraph style, while a text:cond-style-name attribute references a conditional-style, that is, a style that contains conditions and maps to other styles (see section 14.1.1).
 sentence: If a conditional style is applied to a paragraph, the text:style-name attribute contains the name of the style that was the result of the conditional style evaluation, while the conditional style name itself is the value of the text:cond-style-name attribute.
 sentence: This XML structure simplifies [XSLT] transformations because XSLT only has to acknowledge the conditional style if the formatting attributes are relevant.
 sentence: The referenced style can be a common style or an automatic style.


new para

 sentence: A text:class-names attribute takes a whitespace separated list of paragraph style names.
 sentence: The referenced styles are applied in the order they are contained in the list.
 sentence: If both, text:style-name and text:class-names are present, the style referenced by the text:style-name attribute is as the first style in the list in text:class-names.
 sentence: If a conditional style is specified together with a style:class-names attribute, but without the text:style-name attribute, then the first style in the style list is used as the value of the missing text:style-name attribute.


new para

 sentence: A page sequence element <text:page-sequence> specifies a sequence of master pages that are instantiated in exactly the same order as they are referenced in the page sequence.
 sentence: If a text document contains a page sequence, it will consist of exactly as many pages as specified.
 sentence: Documents with page sequences do not have a main text flow consisting of headings and paragraphs as is the case for documents that do not contain a page sequence.
 sentence: Text content is included within text boxes for documents with page sequences.
 sentence: The only other content that is permitted are drawing objects.




but this would of course still fail if the sentence were to contain
". " coming from "D. P. Carlisle" or "dr. " or ...

If you try to parse natural language with a single regular expression,
it will _always_ fail. But you can cover more or less arbitrarily
complicated subsets of the language by making the regexp
correspondingly more complicated (and slow)


David

________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent