Click a case study title below to jump to the study.
|Medtronic: Data conversion after acquisitions on five continents||
|GTT: Web scraping data on every U.S. town and city||
|Merrill Corp: Getting 25,000+ unformatted contacts into a database||
|Various clients: Deduping customer records||
Medtronic: Data conversion after acquisitions on five continents
Following the acquisition of medical device manufacturers and sellers on five continents, the voluminous information of those companies had to be recorded in Medtronic’s ERP system, SAP. Besides preparing those varied data structures to conform to those of Medtronic, the data also had to conform to Medtronic's SKU and product-naming conventions. Foreign words had to be translated to English. Imperial units of measure had to be converted to metric. And as with all data conversion projects, missing, incorrect, and duplicated information had to be identified and corrected.
Medtronic organized the project as a multi-phased, multi-staged series of hygiene, conversion/load testing, and data transformation tasks. Transformations were to be made as a series of transitions through intermediate table structures. Quality checks and corrections were to be made before and after each transition.
The project had two managers, three business analysts, and almost a dozen developers. As Data Cleansing Analyst, I was the only quality assurance person on the project. I reported to the two project managers. My responsibilities can be summarized as:
- For every standard, composing a test script with versions for the different phases and stages as necessary.
- Running test scripts and verifying their success.
- Reporting and tracking errors found.
- Assisting in research to create cross-reference (XREF) tables to perform remediation.
- Assisting in identifying the same parts and products sold by the different sellers using different names.
As the sole person in my role, I needed automation to work efficiently and to reduce chances for mistakes.
The project was inherently complex, so creating a system of organization was critical. The application I would create needed to:
- Record every data standard, giving each a unique ID.
- Organize standards by their corresponding phase and stage.
- Associate each standard with the script that tested for it.
- Selectively combine scripts to test multiple standards at once.
- Track each error found by standard, script, table, field, and date.
- Report errors with all information to locate and correct them.
- Reduce opportunities for error when I selected and ran scripts and reported their results.
I built QUATRA (Quality Tracking app), an MS Access/Oracle/VBA application. Meeting the needs described above included:
- Combining standard-specific scripts into a master script based on click-to-select on the UI. The standard-specific scripts were written as text files and stored in a strict file folder structure on the network.
- Recording discovered errors in an error table stored on Oracle. Linking to this table from QUATRA, I was able to automate my reports, creating Excel files attached to auto-generated emails. The body of the email gave summary statistics of the test results. The date of each report delivery was automatically logged in my app.
- Automatically logging the date of generation that master scripts for specific tests were created, assigning the date-time group (DTG) to those scripts and saving them.
Some other features that were important were:
- On the app’s screen, standards for which I had not yet created a script appeared in red as a reminder. The first screen of my app listed all such work still to be done across all phases and stages.
- For each standard, the app displayed the dates of first and latest test runs along with the error count of the latest test.
When designing QUATRA, I made two major procedural decisions.
Major decision #1: How to execute my scripts
Because ease of use can also make it easy to make mistakes, I decided not to execute scripts from the app itself. I could have done this with MS Access pass-through queries to Oracle. Instead, I decided to manually copy and paste a new master script (automatically opened on the screen after creation) into Oracle, where I had to manually select and log into the appropriate environment of Development, Test, or Production. This manual task made me very conscious of the environment I was about to use.
Major decision #2: How to report errors to facilitate their correction
The second procedural decision I made was to record only one error per row in my error table rather than recording all errors of a given row of source data to a single record in my error table. The principal fields in my error table were:
- Standard_Short_Description (what the field should have been like)
…where the Standard_ID would indirectly provide the details on the phase and stage in question.
With one line for each error found, an error report could be very long. However, such a report could be easily filtered by table name and field so work could be distributed within a data correction team. In this way, one person could focus on correcting all errors in just one field in one table before starting on something else.
As mentioned earlier, the alternative to my error table design would have been to report in one row the entire contents of any record with errors along with details on all the errors found in that record. There could be a dozen or more fields from the source record itself with several dozen more fields to record details on each error. This would require different error table structures to accommodate the many different error types that might be encountered during each phase/stage. The complexity would mean that the error table itself could be a source of error.
I decided against this full-record approach for two reasons. The first I mentioned: the need to create (and test) many complex error table structures. The other reason was that I knew from first-hand experience that a person is far more efficient and accurate when concentrating on correcting one type of thing at a time rather than multiple types of things in multiple places. I had the opportunity to test this anew when one of my business analysts insisted on the full-record approach for her team's area of responsibility. Her team ended up taking longer to complete corrections than any other team and had a much higher rate of omitted corrections. When that business analyst was later reassigned to other responsibilities, I was able to deliver only the more effective single-error version of my error reports.
Development time pays off
Designing and building QUATRA took me several weeks. Since this was at the start of the project when managers and developers were still planning, I had the time to do this. The investment paid off handsomely. I had enough time to do all my work well and even to assist on extra projects (such as experimenting on how to update South Korean addresses after a change in the postal code system). Most importantly, I avoided all the mistakes I might otherwise have made, such as omitting tests, reporting the wrong results, or forgetting to compose a new script in time. QUATRA had my back.
For sales and marketing purposes, I was asked to get population and land area for every city in the U.S. with a population of 5,000 or more. Although the U.S. Census Bureau offers that information on their website, I discovered their data was not reliable for my purposes.
The problem with the U.S. Census bureau data for political entities below state is that those entities are defined for purposes of conducting the census and often do not correspond to political borders.
Consider the incorporated town of Waterford, Connecticut. It is a single political entity with no hamlets, villages, or smaller government entities within it. Despite this, the U.S. Census Bureau records two entries for the town: "Waterford" and "Waterford CDP" (Census Designated Place). Adding the two numbers to get the town's actual population is a simple task if we were only interested in this town, but when dealing with almost 30,000 towns and cities in the U.S., it is a big problem. It became an even bigger problem when I saw that Census Designated Place names and geographic names are often not similar even when they pertain to the same political entity. The easy way out would be to consider CDPs as towns. However, that was out of the question because my client sold to governments.
Similarly, the U.S. Census Bureau often ignored political subunits of towns and cities. In some states, a town, township, or city could have political subunits that have purchasing power. In New York State, for example, these were hamlets and villages, which were required to provide their own fire protection (a fact that interested my client). Due to the nature of my client's marketing, they wanted these subunits reported as towns in their own right with their own data in addition to parent entity and its data. Detailed searches for these political subunits on the Census web site would require me to know which larger political units had such subunits in the first place (which I did not). In any event, as mentioned, the Census often did not report on these subunits.
To make up for the short-comings of the Census data, I had to find an additional source to deal with the "census designated place" red herring and the lack of coverage of political subunits. This additional source needed to let me specify a state and then drill-down to the lowest political entities, however many there might be, including unincorporated areas.
My solution was to web scrape Wikipedia. By state, I could compile a database of U.S. county subunits and independent cities down to their lowest level. Later, where the Census Bureau reported two "towns" of the same name (one an actual town, one a CDP), I would use the information from Wikipedia, which reported data for the actual political entity.
While many would question the trustworthiness of crowd-sourced Wikipedia, the esoteric nature of my subject made deliberate errors unlikely. As spot-checking and post-search comparisons would reveal, Wikipedia did provide reliable data.
At the bottom of this case study is an example of the structured data Wikipedia provided for web scraping the population, land area, and county information of a town --- a town as a political entity and not a Census Designated Place of smaller area and population.
The process was an iterative one. For each U.S. state and territory, I web scraped their Wikipedia "List Of Cities (and Counties)" and related pages to get the links to the pages on the individual entities. I then web scraped those pages for the population and land area data. I also recorded county information to account for multiple towns of the same name in the same state. Where there were gaps in my results due to variations in template layout for an article, I composed new scraping commands, sometimes having to hand-copy information in the end. In the rare instance where a Wikipedia page reported on a CDP in addition to the political entity, my scraping code omitted the CDP.
One-to-one comparisons of the end result with Census Bureau data I downloaded earlier found no discrepancies that could not be explained. Explanations consisted of older census data being in the Wikipedia article or the Census data having a town split into a CDP and another administrative division.
I used the Wikipedia data to supplement that of the Census Bureau. When the Census reported a higher population than the corresponding Wikipedia data, I used the Census data because I knew it was the more recent (and the CDP scenario would not account for a higher population being reported for a town). However, when Wikipedia reported a higher number for a town than the Census, I used Wikipedia because I knew it was free of CDPs and the Census data was not. In no instance did I find a Wikipedia-reported population alarmingly larger than what the Census Bureau reported.
The process took more than a week, but I created a database of 29,671 "towns" in the 50 United States and its territories. This included unincorporated areas, counties with no subdivisions, and incorporated subdivisions of towns.
When completed, I compared the results with government entities already recorded in my client's NetSuite database. I appended new customers and selectively merged existing client records with the standardized place names and other data from my research.
With my new, substantiated respect for Wikipedia and being grateful for its vital help, I became a financial contributor to the Wikimedia Foundation after this project.
My client purchased a membership directory of over 25,000 corporate executives in the U.S. and Canada and with it the rights to copy and electronically store the information for its own use. However, the data was not delimited. Worse still, the layout of the information varied. Contact information appeared in non-standard ways with phone number and email switching places from entry to entry. There could be more than one phone or email. "Area of expertise", if it was recorded at all, would be in an unformatted paragraph that might also contain education history, organizations, and other biographical data --- with no standard order. Sometimes there was no paragraph break between one company's entry and the next.
Parsing this by hand would take weeks, but I did not intend to parse by hand.
- Company names were always in capital letters.
- The line with city, state and ZIP code was identifiable by the distinctive comma, two-letter state/province, and the tell-tale U.S. or Canadian postal code.
- Email addresses and phones had a distinct, consistent format.
- Biographical categories were prefaced with keywords like "Education:" and "Organizations:" with colons at the end.
My solution would need to be an iterative approach that first identified what was certain. Then I could infer and test for what was before and after that point of certainty with smaller and smaller levels of granularity.
My concept was to insert the equivalent of XML tags around each piece of information as I identified it, starting with a general <company> and </company> to delineate the blocks pertaining to a single company, and then work my way through to slightly more specific tags like <contact info> and <bio>. Finally, I could parse these intermediate clusters to their very specific information types: <address1>,<address2>, <city-state-zip>, <email>,<education>, etc.
I would do all the parsing within MS Word so I could take advantage of the paragraph breaks (where they existed) to help me parse. Using VBA within MS Word, I created and tested three scripts, each corresponding to the increasingly more granular taggings/parsings described in the previous paragraph. Each script included "certainty tests" before it would apply tags around a block of text identified as possibly being of a certain type.
Once all my scripts ran and tags were applied (almost entirely correctly, it would turn out), I used web scraping software to extract each piece of information based on its opening and closing tags, putting each piece into the appropriate field of an MS Access table. In other words, I used a reverse-XML approach.
The advantage of the web scraping software was that it did not matter what order the tags were in, such as whether email was always after phone number (which sometimes it was not). Everything within a given tag pair was extracted and put into the appropriate database table field. Later, when a given company had multiple phone numbers and emails, I used VBA to parse the extra information from its current field to an "Other" field.
Out of the more than 25,000 contacts, about 80 had to be manually parsed. I knew which they were in the original document because the <company name> extraction was always correct even if the rest of the extraction failed. Once the new data was complete and formatted in MS Access, I selectively appended it to my division's marketing database in Oracle.
At several of my clients' companies, their customer database was plagued with a high number of duplicate customer records. Over time, sales details and other information would get divided among the duplicates, giving an incomplete picture of customers and their activity.
- Ideally, duplicates should be prevented at the point of data entry. This is not a feature of major CRM products.
- Software designed specifically for de-duping is costly and still requires a conscientious person to decide when and how to merge records. Address standardization services can also be costly and not cost-effective when the database is too small for volume discounts.
- De-duping is complicated by all the variations of how an address and name can be recorded (in addition to misspellings).
- Unique identifiers, such as SSNs, are often not available, especially for companies. Phones and email domains are useful if available, but there can be more than one email domain for a customer.
My solution for any of my clients would therefore need to be:
- Free or inexpensive.
- Provide "common denominators" among records.
- Prioritize & organize what a human reviewer must do.
My work to prepare for human review includes creating a copy of the customer table and adding special "de-dupe" fields that hold portions of information from the fields-of-record. For example, I would have a de-dupe field containing company name without vowels and another consisting of phone number without area code and non-numeric characters. I populate these fields using VBA. I then use VBA to compare every customer record to every other customer records using these de-dupe fields and the fields-of-record --- sometimes alone and sometimes in combinations --- and so identify possible dupes. As my VBA finds matches between two records, a weighted value based on the field(s) matched is added to a cumulative confidence score for those two records being possible dupes of each other. The record IDs of the two records and their final confidence score are recorded in a table.
Toward the end of the VBA process, all possible related dupes are grouped together. To do this, I use an algorithm to select a "best" record that would be the surviving record among any set of dupes. I then associate that potential survivor with all its suspected matches above a certain confidence score I have chosen. I also associate with each designated survivor record all the matches of its matches above the cut-off score. Think "six degrees of Kevin Bacon", but using only one degree in this case. As a result, my later human eyeball review will review any record only once and only for its most likely group of matches.
Based on the number of possible dupes within ranges of confidence scores, I let the client choose an additional, higher cut-off confidence score below which possible dupes would not be considered. Deciding on that cut-off is an iterative process of looking at examples of suspected dupes at various possible cut-off points. If I am doing the final dupe review, the cut-off point will often be based as much on my billed time as on the utility of a particular confidence score cut-off.
I then provide an interface for the human reviewer (often me) to compare the possible survivor record to its possible duplicates. The reviewer can overrule what the VBA has decided and unselect a record from any group of possible dupes. Then the reviewer copies any top-level customer information that is worth preserving from the to-be-purged records and copies it to the survivor record. The transfer of activity data is done in the next step.
At end of reviewing all records (filtered by the confidence score cut-off previously agreed upon), the result is a table of SURVIVOR records and corresponding PURGE IDs. If the CRM system cannot use such a table, merges can be done with update queries on the CRM database's foreign tables, replacing the PURGE ID with the record ID of the SURVIVOR record.
My solution has evolved over time as online resources have become available and unavailable. An example of a resource no longer available is the U.S. Postal Services address lookup feature. You are now limited to only a few free lookups each day and their API still forces you to search for addresses one at a time.
A recently available new resource comes from the U.S. Census Bureau. Their website lets you upload a file of 1,000 addresses with an ID number of your choice (no names) to compare to its database of residential and business addresses. The file it returns will tell you if it could find the address in its records, what that address-of-record is, and how closely the match was (EXACT, NON-EXACT, TIE, or NO MATCH). Of course, this is only useful for U.S. addresses, but providing a standardized address for any U.S. address (when possible) is an excellent way to match two or more customer records confidently. The returned file also lets me add geocoding information (e.g., Census block and Census tract) to my deduped surviving records. See the slideshow below to learn how this Census Bureau feature works.