Though PDF is a ubiquitous data format in business today, data contained in PDFs is not readily available for mapping to other systems. PDFs are typically designed for human-readable content with variable formatting and layouts, making structured data extraction extremely challenging. They may contain text, images, tables, and other elements, and the data is not organized in a machine-readable format. Typical PDF data extraction tools may not provide accurate results, especially for PDFs with complex layouts. That's where the MapForce PDF Extractor comes in.
The MapForce data mapping tool now includes the MapForce PDF Extractor, an easy-to-use utility that allows you to quickly define the structure of a PDF document and extract data from it. Then, that PDF data can be accessed for further transformation and conversion to other formats such as XML, JSON, databases, Excel, and so on, in MapForce. It is the ultimate tool for enabling PDF data integration and ETL projects.
Using visual tools in the MapForce PDF Extractor, you can define the structure of a PDF document and efficiently extract its data. PDF Extractor is a highly flexible tool that allows you to extract only portions of text instead of the whole document, mix and match pieces of information from different pages of the same PDF file, split tables into rows, and arrange data into groups.
The intuitive, straightforward design of the MapForce PDF Extractor makes it easy to define PDF document structure in a visual way, using point-and-click and drag-and-drop functionality. At last, the vast volumes of data previously locked in PDFs is available for mapping to other formats.
When you load a sample PDF to create a template and define data extraction rules, the PDF is displayed next to a schema pane. The schema pane displays a tree structure that represents how the data will be extracted. The MapForce PDF Extractor includes a powerful suggestion engine that automatically identifies common document elements and attempts to detect their structure.
For instance, the suggestion engine will identify tables that exist in the document, which you can then opt to extract automatically. A split operator in the schema pane helps you define how to correctly divide the table into separate rows. The suggestion engine can look for edges or lines to create the split, or split based on a fixed distance, for example, which you can preview in the PDF view pane. At the same time, the suggestion engine captures columns and header text. Clicking on any object in the schema tree highlights the corresponding structure and data capture rules as they apply in the PDF document view.
After the tabular data is extracted, you can adjust the extraction rules as necessary to exclude some fragments, adjust anchor assignments, define table boundaries, and so on. This can be accomplished using visual tools and helpful pull down menus. You can preview the results of data extraction in the output tree to check for accuracy.
Other document elements can be captured and added to your template manually. To define rules for extracting data manually, simply select an area in the PDF to extract by capturing it in a rectangle. Then, select Text Capture from the right-click context menu. PDF Extractor adds the capture as an element in the document tree, and you can drag and drop it to the desired position in the tree.
As you work, the MapForce PDF Extractor builds an XML document representing the structure of your PDF template with sample data from the working PDF document in the output window. This helps you understand and perfect the results of the extraction that will become a template to use in MapForce.
Once you save your template in the MapForce PDF Extractor, you are ready to insert it as a source data component in a MapForce data mapping project. Common PDF conversion requirements include:
Of course, MapForce can also mix-and-match with multiple source and target data formats, chained data mapping projects, and more. A rich library of data processing functions and a visual function builder make it easy to filter and process data before writing it to the destination(s).
With the PDF Extractor, MapForce finally makes critical business data previously locked in PDFs available for data mapping, data integration, and ETL processes.
“Altova MapForce provides excellent mapping capabilities that we can seamlessly embed within our core products. The extensible nature of the product means it covers all of our solution requirements.”