Altova MapForce 2024 Enterprise Edition

The Split object (illustrated below) cuts a particular part of a page into pieces. The Split object can discard a fixed number of initial and/or final snippets of a region and supports different means of locating split positions. For details, see the Properties subsection below.

 

For information about how to add objects to the model tree, see Insert an Object.

PDFEX_SplitObject

Properties in the Properties pane

You can configure the following properties of the Split object:

 

 

Example 1: Find lines or edges

This example shows how to configure the Find lines or edges method. The goals of this example are as follows:

 

To extract data from the table

To exclude the top part of the page (which contains the header, company, client, and invoice details), the header row of the table, and the bottom part of the page from processing

 

To achieve the goals, we have configured the Split object in the following way:

 

The Skip Initial property has been set to 2.

The Skip Finial property has been set to 1.

The Method has been set to Find lines or edges.

No value has been set for the Region, therefore, the whole page is treated as a region.

 

The algorithm has identified the first edge in the location where the header row starts and the second edge in the location where the header row ends. Therefore, the upper part of the document together with the header row of the table have been excluded from processing (grayed-out top part in screenshot below).

 

The Skip Final value (1) has caused the algorithm to exclude the Subtotal, Sales Tax, and Total cells, because the first edge from the bottom of the region has been identified on the line where the Fence repair row ends. The rest of the table will be split into rows (grayed-out bottom part in screenshot below).

PDFEX_SkipInitial2

 

Example 2: Find objects

This example shows how to configure the Find objects method. The goal of this example is to extract table data from the sample invoice illustrated below.

PDFEX_BookInvoice

The table shown in the screenshot above does not contain regular grid lines, which makes it difficult to identify correct split positions. Besides, the cells in the second column (No) and the cells in the third column (Description) overlap. In order to correctly split the table into rows, we have selected the Find objects method and configured it as follows:

 

The Background Color and Tolerance properties have default values (#FFF and 10%, respectively).

The Minimum Extent property has been set to 4pt, which helps eliminate objects smaller that this value.

Since there are no gaps that can be filled in, the Fill Gaps property has its default value (0pt).

The Edge to Find property has been set to Start, which means the objects will be split in locations where they start.

By trial and error, we have identified the ideal value of the Displace property, which is -3pt. This value has caused the split positions to move slightly upwards, which will prevent the data from being truncated.

No post-processing options have been defined.

 

Search region

Since there are no consistent lines along which the table could be split into rows, we use the Search region to identify reliable split positions, which will then be applied to the whole Region. The screenshot below shows that the Region contains all the rows of the table (light yellow area). The Region represents an area that we want to split. However, the Search region (bright yellow rectangle below) covers only the first column of the table, in which detecting objects works more reliably than in other parts of the table.

PDFEX_BookInvoiceSearch

If no Search region is used, the splitter will identify the split positions shown below, which will lead to incorrect results in the output.

PDFEX_BookInvoiceNoSearch

 

© 2018-2024 Altova GmbH