W  I  L  L      B  E  A  U  C  H  E  M  I  N
Sound data is the foundation.

To be useful, data must be updated, archived, and purged periodically.  It must be de-duped, standardized, and neatly organized.  Of course, it must also be obtained:  extracted, scraped, or copied.  All of that is what I do in data wrangling.  My more notable projects are described on this page.

DE-DUPING
To identify duplicate records ("dupes"), I do more than look for exact matches:  I use algorithms and confidence scores to identify definite and likely dupes.  Human judgment is still needed in the end, though.  To facilitate that review, I create a simple user interface where the surviving record can be tagged and information salvaged from the records that will be purged.  My case study at the bottom of this page will go into more detail on how I provide de-duping services.

 

 

WEB SCRAPING
Web scraping is the extraction of data from a website using automation.  The most common use is for online databases, automating the searches a human visitor would otherwise need to run.  

Web scraping has gotten a bad reputation from unscrupulous practitioners who violate clear terms-of-use limitations and burden web site host servers with poorly planned scrapes.  The discipline itself can be practiced ethically.  Many web sources, notably those of the U.S. federal government, have few if any restrictions on the use of their data.   See my case studies on this page and my "Creative Problem-Solving" page to learn more.

GTT:  Web scraping data on every U.S. town and city

SITUATION

For sales and marketing purposes, I was asked to get population and land area for every city in the U.S. of a particular population.  The U.S. Census Bureau provides information for "Cities and Towns" by state, but the data was not totally reliable and was incomplete for my project, as my analysis showed.

ANALYSIS

  • Census data by political entity sometimes reported partial data for a town with the remainder reported for the town's "Census Designated Place" (CDP).  Example:  Waterford, Connecticut in 2010.  There is only one town in Connecticut with that name, incorporated or otherwise, but the population is divided into two versions of "Waterford" (see graphic below).   Complicating matters, CDP names and geographic definitions did not always correspond to a town, so simply combining CDP and ton data for the same "town" was not a reliable option.  In other instances, the CDP is an unincorporated area that does need to be included in my results as its own entity, so I could not always exclude CDP data from consideration. 

  •  
  • In some states, a town, township, or city could have political subunits that have purchasing power.  In New York State, for example, these were hamlets and villages.  Due to the nature of my client's marketing, they wanted these subunits reported in addition to the total data for their parent entity.  However, the Census only reports data for their parent entity at the "Cities and Towns" level.  Detailed searches for these political subunits on the Census web site was impractical.  You would have to know the name of the larger political unit to search for them.
  •  

For these and other reasons, I had to use another source to make up for the short-comings of Census data.  Specifically:

  • A source that would let me specify a state and then drill-down to the lowest political entities, including unincorporated areas.

My solution was to web scrape Wikipedia.  By state, I would compile a database of U.S. county subunits and independent cities down to their lowest level.  Where the Census Bureau reported two "towns" of the same name (one an actual town, one a CDP), I would use the information from Wikipedia, which would report data for the actual political entity.

While some may debate the trustworthiness of Wikipedia, the esoteric nature of my subject made deliberate errors unlikely.  As spot-checking and post-search comparisons would reveal, Wikipedia did provide reliable data. 

Shown right is an example of the structured data Wikipedia provided for web scraping the population, land area, and county information of a town.

 

SOLUTION

The process was an iterative one:  I web scraped the Wikipedia "List Of Cities" and "List of Townships" articles on each U.S. state to get the links to the individual entity articles, then I web scraped those articles for the population and land area data.  I also recorded county information to account for multiple towns of the same name in the same state.  Where there were gaps in my results due to variations in template layout for an article, I composed new scraping commands, sometimes having to hand-copy information in the end.

Random comparisons of Wikipedia data to U.S. Census data for towns large and small proved Wikipedia's reliability. 
A comparison of the end result with Census data found no unexplainable discrepancies in the political entities reported --- although Wikipedia did return more levels (as explained earlier).
 I also found no anomalies in the Wikipedia data and the U.S. Census Bureau was always cited as the source when the source was cited (which was almost always).

The process took days, but I created a database of 29,671 "towns" in the 50 United States.  This included unincorporated areas, counties with no subdivisions, and incorporated subdivisions of towns.  After performing data clean-up, I was able to create my database of U.S. towns:

  • When the Census reported for both a town and a "Census Designated Place" (CDP) of the same name, I used Wikipedia when it reported a higher population than the Census bureau's town data.  
     

  • Otherwise, I used the Census Bureau's data.

There were no fees for the sources and methods I used, but I did become a financial contributor to the Wikipedia Foundation after this project.

Merrill Corporation:  Getting 25,000+ unformatted contacts into a database

SITUATION

The client obtained a membership directory of over 25,000 corporate executives in the U.S. and Canada.  However, the data was not delimited.  Worse still, the layout of the information varied:  Contact information appeared in non-standard ways.  Phone number and email varied in their order of appearance and there may be more than one of each.  Area of expertise, if it appeared at all, would be in an unformatted paragraph that might also contain education history, organizations, and other biographical data --- with no standard order.  Sometimes there was no paragraph break between one company's entry and the next.  Parsing this by hand would take weeks.

But I did not parse by hand.

ANALYSIS

  • Company names were always in capital letters.
  •  
  • The city, state and ZIP code line was identifiable by the distinctive comma, two-letter state/province, and the tell-tale U.S. or Canadian postal code. 
  •  
  • Email addresses and phones have a distinct format.
  •  
  • Biographical categories were prefaced with keywords like "Education:" and "Organizations:".

My solution would need to be an iterative approach that first identified what was certain.  Then I could infer and test for what was before and after that point of certainty.  

 

SOLUTION

My concept was to insert the equivalent of  XML tags around each piece of information as I identified it, starting with a general <company> and </company> to delineate the blocks pertaining to a single company, and then work my way through to more specific tags (e.g., <contact info>, <bio>) to the final specific tags (e.g., <city-state-zip>, <education>).

I would do all the parsing within MS Word so I could take advantage of the paragraph breaks to help me parse.
MS Word macros to identify start and end of each company's entry (which may have no people or several recorded within it).
Within each company block, I would identify the biographic paragraph by any of its keywords:  anything above it and below a company name was contact info.
I would do certainty tests and insert the final tags (e.g., <name first>, <email>).
Finally, I would use web scraping software to extract each piece of information by its tag and put it into an MS Access database.

The advantage of the web scraping software was that it did not matter what order the tags were in, such as whether email was always after phone number:  Everything within the <company> and </company> tags was extracted.   Double-entry phone numbers and emails, where they occurred in the database results, were then separated into separate fields using VBA.

Out of the more than 25,000 contacts, about 80 had to be manually parsed.  I knew which they were in the original document because the <company> tag extraction was always correct even if the rest of the extraction failed.

Various clients:  Deduping customer records

SITUATION

When entering information into a CRM system, users often create a new customer record rather than check for an existing one.  Over time, sales details and other information is divided among the duplicates, giving an incomplete picture of the customer with unknown consequences.

ANALYSIS

  • Ideally, duplicates should be prevented at the point of data entry.  This is not a feature of major CRM products.
  •  
  • Software designed specifically for de-duping is costly and still requires a conscientious person to decide when and how to merge records.  Address standardization services can also be costly and not cost-effective when the database is too small for volume discounts.
  •  
  • De-duping is complicated by all the variations of how an address and name can be recorded (in addition to misspellings).
  •  
  • Unique identifiers, such as SSNs, are often not available, especially for companies.  Phones and email domains are useful if available, but there can be more than one for a customer.

My solution would therefore need to address 4 areas:

  • Free or inexpensive.

  • Simple.

  • Provide "common denominators" among records.

  • Prioritize & organize what a human reviewer must do.


My solution has evolved over time and as online resources have become available and unavailable.  An example of the latter is the U.S. Postal Services address lookup feature:  You are now limited to only a few free lookups each day --- and their API still forces you to search for addresses one at a time.

 

SOLUTION

My work to prepare for human review includes creating "de-dupe" fields to identify possible matches and calculate a total confidence score.  Based on the number of possible dupes within ranges of confidence scores, I let the client choose a cut-off score below which possible dupes are not considered.

My special de-dupe fields include:

  • For businesses, email domain and the first seven digits of the phone number.

  • Building number PLUS first word in street name (ignoring any one- or two-letter directionals) PLUS ZIP5 (or entire Canadian postal code).

  • A de-dupe field for NAME with all vowels and punctuation removed.  The same for ADDRESS1 and ADDRESS2.  Besides trying to find exact matches using these fields, I also count the number of matching words to add to the confidence score.  NOTE:  Vowels are usually what is misspelled in a word, which is why I consider text without them.

  • The standardized address returned by the U.S. Census Bureau.

The last point is a recent innovation made possible by the Census Bureau creating a bulk-upload and return feature for addresses:  upload a file of 1000 addresses (no names) and it will tell you if it could find the address in its records, what that address-of-record is, and how closely the match was (EXACT, NON-EXACT, TIE, or NO MATCH).  Of course, this is only useful for U.S. addresses.

By comparing my de-dupe fields as well as the contact information fields proper, I calculate a confidence score by which I can filter and prioritize possible matches.  I then provide an interface to the human reviewer (often me) to compare a "best record" to its possible duplicates.  

I identify the "best record":  the record with the most transactional data associated with it.
Using the interface, the user copies information from the duplicate records to the "best record" as needed.
I tag records to be merged away, creating a table of SURVIVOR/PURGE IDs.  If the CRM system cannot use such a table, merges can be done with update queries on the CRM database's foreign tables. 
From the Census results, I can also append Census block and Census tract IDs for use in applying demographic data later.

As time allows, I am working on automating more and more of the de-duping process, including interaction with the Census Bureau web site.

To learn how the Bureau's new feature works, click the images below.

1. DIKW pyramid, Creative Commons copyright 2015 Longlivetheux.

Website designed and created by Will Beauchemin.  Graphics 2013 Iconshock by Unusual Minds and 2002 Riverdeep Interactive Learning Ltd and its licensors.  Map graphics 1988-2012 Microsoft Corporation and/or its suppliers.  Except as noted, website 2015 Will Beauchemin.