Altova MapForce 2025 Enterprise Edition

The Text Capture object enables you to extract some text from a page of a PDF document. When you create a text capture, it appears in the model tree in the Schema pane and in the Output pane. You can optionally wrap a text capture inside an XML tag, by giving the capture a name, which will help you organize the tree in the Output pane into a meaningful structure (see code listing below). The default name of a text capture is Capture. For information about how to add objects to the model tree, see Insert an Object.

 

<Invoice>

<Header>GARDENING SERVICES INVOICE</Header>

<BillTo>Oswald Grim

Darkwood St. 17

Boston, MA 02128

+1-617-8767675</BillTo>

<InvoiceNo>4560123</InvoiceNo>

<Date>2023-09-05</Date>

<...>

</Invoice>

 

When you click a text capture in the model tree of the Schema pane, the capture becomes immediately highlighted in the PDF View pane (screenshot below), which helps to easily locate the capture on the page. The highlighted area has a text label that corresponds to the capture's name visible in the model tree and in the Output pane. You can also click elements or their values in the Output pane to see what objects they refer to on the page of your PDF document. For details, see Step 2 of the tutorial.

PDFEX_TextCaptureHighlighted

 

Properties

You can configure the following properties of the Text Capture object in the Properties pane: Output XML Name, Region, and Algorithm (see below).

 

Output XML Name

The Output XML name property represents the name of the Text Capture object. If this property contains a value, this value will appear as an element in the XML tree in the Output pane. If the object has no name, no additional element will appear in the Output pane.

 

Region

The Region property refers to the area of the page that a text capture occupies. If no value is specified, the Region property refers to the whole page. The screenshot below shows the definition of a region as well as the locations of the left, right, top, and bottom parts of a text capture called Col1.

PDFEX_Col1Region

Algorithm

The Algorithm section contains various properties that enable you to handle basic PDF-processing procedures. The algorithm is visual in that it does not depend on the internal structure of a PDF document but rather on the order of visual elements of the document. The available properties are described in the table below.

 

Baseline Tolerance

The Baseline Tolerance property specifies a distance for text baselines, which allows dealing with cases of minor text misalignment.

 

Paragraph Spacing

The Paragraph Spacing property specifies which baseline-to-baseline distance is considered a paragraph. This property enables you to configure whether two consecutive lines are treated as parts of the same paragraph or not.

 

If the two consecutive lines belong to the same paragraph, no line break is inserted, and a space is inserted to join the lines, unless the Insert Space option is disabled (see Separate Words below). Setting this property to a high number enables you to treat the entire text as a single paragraph. The result will contain a line break for every paragraph, while non-paragraph line breaks in the PDF file will be converted to a single space by default.

 

Baseline Angle

The Baseline Angle property is used to extract vertical or angled text. The default option is set to 0°.

 

Angle Deviation

The Angle Deviation property specifies the extent to which the baseline of characters on the page can deviate from the Baseline Angle. The default is set to 180°, which means that every character will be taken into consideration. If you want to extract angled text, you can use this property to discard any non-angled text in the vicinity.

 

Separate Words

The Separate Words property specifies whether spaces are preserved or removed. If you select the Insert Space option, all the existing spaces will be preserved. If you select Glue Together, all the spaces will be removed. The Glue Together option is particularly useful for URLs that have line breaks.

 

Whitespace Mode

The Whitespace Mode helps tackle cases in which the embedded fonts do not contain a value of the width of a space character. The whitespace mode can also handle cases where spacing between characters is overridden by typesetting. The whitespace mode has the following values: Default, Relative Value, and Absolute Value (details below).

 

Default

The default option assumes that the width of a horizontal space is the width detected for the font.

 

Relative Value

The most common way to use the Relative Value option would be to set a value between 0.2 and 0.5, which is what most proportional fonts use, or 1.0 for a mono-spaced font (in case PDFExtractor does not extract whitespace correctly by itself). The width is relative to the font character cell width.

 

Absolute Value

The absolute mode helps solve problems with fonts of different sizes used side-by-side. The absolute mode takes width in absolute units (e.g., points, millimeters). The recommended value in this mode would be roughly 20%-50% of the font size used. For example, for a 10pt font size, the recommended value would be between 2pt and 5pt.

 

 

© 2018-2024 Altova GmbH