Rare edge cases can derail loosely coupled data mapping applications. This is especially true when you are consuming large datasets available over the Internet and have little or no influence over the source data. In this article we describe a debugging technique that lets developers working on data mapping and transformation projects quickly identify and accommodate unexpected data in a stream from a remote source. The Problem Last summer we wrote a series of blog posts describing how to work with the Groupon API to retrieve a subset of offers in all Groupon cities and format the list for a web browser or mobile device. We concluded with a command line to run a MapForce data mapping that calls the Groupon API over 150 times — once for each Groupon city, then filters the data to extract deals sold on the Internet instead of a physical location, and formats the results in HTML using StyleVision. Every morning we run the command line in a batch file that saves the HTML output on a local server so our colleagues can check it out with any Web browser to find interesting offers from all over the country. The mapping ran fine for more than two months until one day it failed with this error message: “Source-value “” of type dateTime could not be converted into target-type dateTime.” The specific explanation is that somewhere in the mapping where we expected a dateTime, we received an empty value. On a more abstract level, the error suggests a potential defect in the logic of our mapping strategy. Every time we call the Groupon API we receive a well-formed XML data stream enclosed in a <response> element, but the API specs do not include an XML Schema defining the data that may be returned. When we developed our mapping we needed to analyze the raw data and select the output we wanted, so our first step was to call the API to capture all the Groupon deals for one large metro area. We assumed we would get a large enough data sample to include every possible option in the API response. After our mapping ran successfully for two months, the API finally delivered a rare edge case that did not fit the pattern we expected. Debugging Tools MapForce provides debugging help. We can run our data mapping using the MapForce built in execution engine to see more details in the Messages window. The lines labeled Related location are hyperlinked back to components in the mapping where the error occurred. Clicking on the result error takes us to a format-dateTime function. We can either click the “” error or trace the value connector to identify the input element to the format-dateTime function. Either way, we locate the element that triggered the error. The suspect element resides in the input component that captures all the data returned by our calls to the Groupon API before any filtering or conversion takes place. When we designed the mapping, the endAt element in our sample data always reported the ending date and time for each Groupon offer, but for some reason we must have received an empty value in this field. If the error had occurred by running a local input file we could simply examine the file contents, but in this case the data came from multiple URLs, and is only held temporarily until it is mapped to the output component. Fortunately, we can apply a trick to easily modify the mapping and preserve all data received from the Groupon API. We simply copy the input component and paste a duplicate into the mapping. We can connect the response element from the original to the duplicate, which simultaneously maps all the child elements between the components. Our original input component is now connected to two output components. We can select which output component will be generated by the MapForce built-in execution engine by clicking the eye icon at the top right corner of any output component. The new output component simply saves a copy of everything in the input component. When we examine the raw data using XMLSpy, sure enough we find an empty element where we expected a date and time: The Solution Now that we know an offer might have no specific end time, we can plan for that possibility in the mapping. In the revised treatment of the endAt element, we do an if-test before the original format-dateTime function and provide an alternate output when the endAt element is empty. We had to work fast because all Groupon data is time sensitive. The edge case would eventually expire and disappear from the data stream. This experience showed us how important it is to have powerful debugging tools and to use them creatively, even after you think a data mapping project is running successfully! Altova MapForce is available in a free trial – the next edge case you solve could be your own. Editor’s Note: Our original series on mapping data from the Groupon API ran in three parts you can see by clicking the links here: Part 1 of Processing the Groupon API with Altova MapForce describes how to create dynamic input by collecting data from multiple URLs. Processing the Groupon API with MapForce – Part 2 describes how we filtered data from the API and defined the output to extract only the most interesting details. Processing the Groupon API – Part 3 describes formatting the output as a single HTML document optimized for desktop and mobile devices, and reviews ways to automate repeat execution.
Tags: charts, data mapping, FlexText, MapForce, XML charts
In this article we use stats from NFL.com and ESPN.com to show how easy it can be to process and analyze online data in new ways – even when it uses different metrics and is only available in textual format. We have seen in previous blog posts how easy it is to gather data from the Internet that is widely available in XML formats. But what about interesting data that is available online but not in an XML format, or data that is buried in legacy data processing systems and only available in textual report format? One such example involves quarterback ratings. The NFL has used a Passer Rating that rates quarterbacks solely based on a passer’s completions, attempts, touchdowns, and interceptions. ESPN introduced a new rating system this year called the Total QBR (Quarterback Rating). The Total QBR incorporates more data, including an expected points average and a clutch play index, that ESPN claims gives a more accurate measure of a quarterback’s performance. Let’s compare the rankings that these system produce to see if we can garner some useful information. For this example we’ll be using the data importing and analysis tools of the Altova MissionKit to compare the ratings. If you want to try this out yourself, the MissionKit is available to download for a 30 day free trial from the Altova web site. You can access the files used in this example here. The first thing we need is the raw data to analyze. Let’s use the entire 2010 season as a data source. We can get the table with Passer Ratings from NFL.com and then copy and paste it as a new text file. We can access a similar table of Total Quarterback Ratings from the ESPN web site and create a second text file. We now have two text files with tables of data in different orders. The next step is to combine the tables into one file and generate charts. First, we need a schema file for the destination of the data. In XMLSpy, we can create an XSD file quickly, and graphically, to contain a series of QB nodes with child nodes of first and last name, team, passer rating and rank, and total QBR and rank. Now, in MapForce, we open the text documents and use FlexText to parse the text and change it into a list of categories. We then build a mapping file in MapForce to map the data from the text files to the destination XML file. Built-in functions make it easy to extract the first and last names from the Player string, and a value-map will change the team abbreviation to a string (ARI is changed to Arizona Cardinals, ATL to Atlanta Falcons, etc.). We set the Priority Context in the test of our filters to make sure we get the correct set of data for each unique quarterback. Once we execute the mapping, we can save the resulting XML data file and use it as the source file in StyleVision to design a stylesheet. In this stylesheet, we create a table of the top ten ranked passers and charts showing the Passer Rating and the Total QBR graphically. Now that we have a visual representation of the rankings of the two rating systems, we can examine their differences and try to see which works better. For example, Peyton Manning was tenth in passer rating, but was second in Total QBR. This can be explained by the Total QBR taking clutch points into account and knowing that Peyton Manning had a few late game comebacks in the 2010 season. Since we now have a collection of files (the XSD file built in XMLSpy, the FlexText and mapping files from MapForce, and the stylesheet design created in StyleVision), we can update the text data files easily to analyze new sets of quarterback data. Later in the season, we can update the text tables with 2011 data, and allow the data to flow through the mappings and into the stylesheet to update the charts and see the rankings for the current season. This example focuses on numbers from the NFL, but this method can easily be adapted to other data sets and data sources that are accessed as text files as well as in other formats. You can learn more about how to use the products in the Altova MissionKit by taking our free online training courses.
Tags: data mapping, database charts, database reports, DatabaseSpy, MapForce, StyleVision
Anyone who manages paid keyword search knows it is hard work! You can look at vast reports of raw statistics and quickly get lost in trivia. At Altova we designed a better way to analyze and manage the performance data for our Google Adwords campaigns. We can creatively query the numbers to: · Quickly aggregate results for subcategories of campaigns, for instance by product, geographical region, or any other grouping · Easily identify trends over time The chart below illustrates these advantages by collecting data for a single Altova product – SemanticWorks – from multiple campaigns over six individual months. Starting Out Like many keyword advertisers, we were viewing statistics in Adwords, downloading CSV files, then spending hours massaging and manipulating the data in spreadsheets to identify and format the information we required. We wanted more immediate and in-depth reporting of keyword performance while retaining full control of the process and managing everything internally. SQL queries of a database of keyword statistics offer a powerful and flexible alternative. In the remainder of this post we explain how the database design, data mapping, and reporting features of the Altova MissionKit can be applied to create an architecture to efficiently track paid keyword performance. Database Design Our choices were to implement a keywords database on an existing database platform already running in the company, an express edition of a commercial database, or an open-source database, since the Altova MissionKit works with SQL Server®, MySQL®, Oracle®, IBM DB2®, PostgreSQL®, Sybase®, and Microsoft® Access®. We chose SQL Server for our database platform. We connected with DatabaseSpy and used the graphical database Design Editor to create the table shown below. Most columns correspond to fields in a keywords report. In order to store multiple rows for each individual keyword – one row for every month of statistics – the table also includes columns for the month and year. Populating the Table The Google Adwords online interface lets users create reports of keyword statistics of specific date ranges and download them as CSV files. We downloaded individual CSV files containing our performance data for each unique month. We used MapForce to map values from the CSV files to columns in the database table and insert the month and year data for each row. The string functions at the bottom center of the mapping diagram remove percent signs and commas from fields we want to treat as numerical data. By doing this in the mapping, we don’t have to massage the columns of data in the CSV files before importing them. Since the CSV files for each month all have the same structure, the mapping needs only minor revisions to import each new month’s data: update the constants at the top that define the starting row id, month, and year. MapForce processes the mapping with its built-in execution engine, reading the CSV input and generating SQL INSERT statements for each row of data. MapForce then allows users to execute the entire generated SQL script by clicking a toolbar icon or from a selection in the Output menu: Querying the Database Back in DatabaseSpy, we can query the database from the SQL Editor window. This query reports the top ten performing keywords for SemanticWorks in October 2011. For data privacy, some fields in the Results chart are hidden. To get additional interesting results, the SQL statement can be easily modified. For instance, the ORDER BY line can sort for highest cost, most clicks, or any other characteristic. The WHERE statement combines data from multiple campaigns. The LIKE keyword treats the percent signs around SemanticWorks as wildcard characters to match any campaign with SemanticWorks anywhere in its name. Other queries could add a geographic identifier such as US or EU, or match on an entirely different column such as adgroup. Of course, all these options depend on a consistent and predictable campaign and adgroup naming system. We created a DatabaseSpy Project to collect all our favorite SQL queries for sharing and convenient reuse. Here is the query we used to generate the chart right in DatabaseSpy that appears at the top of this post: This query goes beyond simple SQL reporting to perform calculations on a subset of the data and format the results. Database Reports We designed reports for the executive team using Altova StyleVision, based on the queries and charts we had already designed in DatabaseSpy. We simply copied our queries from the DatabaseSpy SQL Editor window and added them as sources in the StyleVision Design Overview window. Saving our report design in a StyleVision SPS stylesheet makes it is easy to regenerate an updated version every month. Here is the HTML output for a SemanticWorks Keyword Trends report based on the query above, displayed in the StyleVision Preview window: If you follow the conventional wisdom for building your own paid keyword campaigns, you will develop segmented campaigns with many small, highly specialized ad groups, and you may also find yourself overwhelmed by the data in Adwords reports. If you’d like to try managing your own keywords the way we describe here, a fully functional trial of the Altova MissionKit is available.
Tags: cloud, diff merge tool, DiffDog, WebDav
Techy folks generally have a good diff tool they rely on to compare and sync files and directories. But what happens when, as more and more info is bound for the cloud, your data lives on servers accessed via URL? There are myriad applications today that live on servers accessed via HTPP – but let’s take a look at a common example: SVN. Subversion (SVN) repositories include WebDAV as a commonly used server option. WebDAV is a natural protocol for SVN because its concern is hierarchy, structured metadata, and versions. Since WebDAV is an extension of HTTP it gives easy access to basic information about files and folders to any HTTP-aware client, including DiffDog – Altova’s diff/merge tool for files, directories, and databases. However, DiffDog knows a few tricks that set it apart from the other breeds.
Diff/Merge via WebDAV
SVN clients typically support command line differencing; however, a text-only representation of the changes in even one file can be hard to read and use. When you want to compare the trunk against a tagged version, the problem is magnified. There are several visual differencing tools available that can help with analyzing version changes in SVN. They have varying degrees of compatibility with how SVN works. Some tools are well integrated with the SVN command line. DiffDog includes all the common comparison options for a tool that is tightly integrated with SVN clients. Where it excels is its ability to talk to SVN servers. Accessing an SVN repository with DiffDog using WebDAV is simple. The easiest starting point is to open Directory Comparison View and paste in the URLs of the folders you want to compare. In this case we’re comparing SVN branches on Projectlocker.com. The two sets of files open, and DiffDog provides a color-coded, browsable view of the differences between the two directories. Clicking on either one of a pair of files opens a detailed file comparison. DiffDog’s ability to distinguish between changes to XML and meaningful changes is key in this situation – most development trees have some amount of XML in them. DiffDog also supports comparing Word docs and databases – so all bases are covered. Of course, folders you compare do not have to both be WebDAV SVN folders. It is equally straightforward to compare the SVN server with a local directory. DiffDog’s ability to access servers via HTTP (or FTP) opens a world of possibilities: comparing a local directory with a Google Docs directory, or diffing a local Web server against files hosted on the Amazon CloudFront , or even just synching photos between your local drive and your chosen back- up service. If you’d like to try DiffDog, it’s available for a 30-day trial over on the Altova Web site.
Tags: API, data integration, MapForce, Twitter, XMLSpy
We found some interesting data when we dug below the surface of the iPhone 4S vs. Galaxy Nexus debate using the Twitter Search API.In today’s world there is a vast quantity of data available online that can be used for research, market analysis, and competitive intelligence. While “Big Data” can be a problem for those who produce it, store it, and compile it, it is highly beneficial for those of us who are looking for answers.Some of that data is fortunately available to be queried online, and, in particular, there is a vast quantity of data on social media interactions out there.In this article we will explore how to use the Twitter Search API from MapForce, Altova’s data mapping/conversion/integration tool, to aggregate data on recent user submissions (“tweets”) on two highly popular topics – the Apple “iPhone 4S” vs. the “Galaxy Nexus” as the latest hot Android phone – and extract some statistical data about the users engaged in those discussions. One of the benefits of this abundance of data available to us today is that we can query it in interesting ways and extract new meaning from it. While there are undoubtedly many existing services that already provide trends over Twitter topics (e.g., Trendistic), those services only offer very simple trends and do not allow us to query any deeper.But all of the underlying data is available for grabs if you are just willing to learn a tiny bit about web service APIs and how to use them to extract XML data for further processing. As a starting point, let’s use the Twitter Search API to query the stream of recent tweets for the last 100 postings that are about the “Galaxy Nexus”. The Usage Guidelines for Twitter Search tell us that using both words in a query will result in the use of the default operator, which is AND, so we are going to search for posts that contain “Galaxy AND Nexus”. So let’s try that and request the most recent 100 items:
Tags: MACPA, MapForce, XBRL, XMLSpy
XBRL is mandated for most public companies. So why are private organizations and non-profits jumping on the bandwagon? This case study examines a real-world success story. We were really excited when the folks at MACPA told us about their success working with XBRL. They set out to discover if XBRL could be used successfully (without a huge upfront investment) by small businesses and NPOs and ended up confirming not only that, but realizing benefits to their internal financial processes, as well.
Toward Ubiquitous XBRL
With close to 10,000 members, the Maryland Association of Certified Public Accountants (MACPA) is often looked to for their expertise on issues relevant to the field of accounting. The US Securities and Exchange Commission’s (SEC) mandate that public companies submit financial data in XBRL is one of those issues. Despite the potential of XBRL for reducing costs and increasing efficiency, many organizations are concerned about the time and expense that will be required to convert all of their financial data into XBRL, a process that can be further complicated when financial data is housed in multiple systems. MACPA set out to prove that these obstacles are easily surmountable: with the right tools, it’s possible to bring XBRL transformation in-house to not only comply with mandates, but realize greater efficiencies and transparency in various scenarios. In the process they discovered that tagging data in XBRL is valuable to private entities and non-profits as well as public companies facing a mandate. They took advantage of widely available XBRL software tools including the Altova MissionKit, which interfaces with multiple relational databases for XBRL mapping, tagging, and reporting. In the end, the project turned MACPA’s financial data into a force for driving efficiencies and accountability. Once their internal accounting data was mapped to XBRL, they were able to automate burdensome data collection, transformation, and analysis tasks to gain more insight into their financial data. For instance, MACPA used their XBRL data to populate their financial Key Performance Indicator (KPI) system, significantly reducing the amount of time and effort required to prepare the KPI documentation. This in turn enables them to run the system at more frequent intervals. They are also now able to automate previously onerous tax filing tasks by mapping the association’s financial data in XBRL to the 990 tax return. (With almost 1.5 million exempt organizations in the US filing hundreds of thousands of Form 990s each year, the efficiency gained by using XBRL could be significant.)
“Ubiquitous XBRL could do for accounting/taxation what barcodes did for retail.” – Skip Falatko, MACPA Director of Finance and Administration
This project not only enabled MACPA to learn about XBRL and advise their members, but also to automate and enhance the way they dealt with their own financial data. And utilizing affordable tools like the Altova MissionKit confirmed that handling XBRL in-house is the way to go.
“Why outsource tagging [your data in XBRL]? If you tag it in house, then you own the data and can use it in myriad different ways as a productivity tool.” – Tom Hood, MACPA CEO and Executive Director
Check out the complete case study to learn how MACPA brought XBRL transformation in-house to effect changes in efficiency and transparency. If you’re an accounting or technical professional who needs to learn more about XBRL, Altova offers free, self-paced online training and an educational XBRL whitepaper.