For sales and
marketing purposes, I was asked to get population and land
area for every city in the U.S. of a particular population. The U.S. Census Bureau provides information
for "Cities and Towns" by state, but the data was not
totally reliable and was incomplete for my project, as my analysis
- Census data by
political entity sometimes reported partial data for a town with
the remainder reported for the town's "Census Designated
Place" (CDP). Example: Waterford, Connecticut in
2010. There is only one town in
Connecticut with that name, incorporated or otherwise, but the population is
divided into two versions of "Waterford" (see graphic
below). Complicating matters, CDP names and geographic
definitions did not always correspond to a town, so simply
combining CDP and ton data for the same "town" was not a reliable option.
In other instances, the CDP is an unincorporated area that does need to be
included in my results as its own entity, so I could not always exclude CDP data
- In some states, a
town, township, or city could have political subunits that have
purchasing power. In New York State, for example, these were
hamlets and villages. Due to the nature of my client's
marketing, they wanted these subunits reported in addition to the
total data for their parent entity. However, the Census only
reports data for their parent entity at the "Cities and
Towns" level. Detailed searches for these political subunits
on the Census web site was impractical. You would have to know the
name of the larger political unit to search for them.
For these and
other reasons, I had to use another source to make up for the
of Census data. Specifically:
My solution was
to web scrape Wikipedia. By state, I would compile a database of U.S.
county subunits and independent cities down to their lowest
level. Where the Census Bureau reported two "towns" of
the same name (one an actual town, one a CDP), I would use the information from
Wikipedia, which would report data for the actual political entity.
While some may
debate the trustworthiness of Wikipedia, the esoteric nature of my
subject made deliberate errors unlikely. As spot-checking and
post-search comparisons would reveal, Wikipedia did provide reliable
Shown right is an
example of the structured data Wikipedia provided for web scraping the
population, land area, and county information of a town.
The process was
an iterative one: I web scraped the Wikipedia "List Of
Cities" and "List of Townships" articles on each U.S.
state to get the links to the individual entity articles, then I web
scraped those articles for the population and land area data. I
also recorded county information to account for multiple towns of the
same name in the same state. Where there were gaps in my results
due to variations in template layout for an article, I composed new
scraping commands, sometimes having to hand-copy information in the
Random comparisons of
Wikipedia data to U.S. Census data for towns large and small
proved Wikipedia's reliability.
A comparison of the end
result with Census data found no unexplainable discrepancies in
the political entities reported --- although Wikipedia did return
more levels (as explained earlier).
I also found no
anomalies in the Wikipedia data and the U.S. Census Bureau was
always cited as the source when the source was cited (which was
The process took days, but I created a database of 29,671 "towns" in
the 50 United States. This included unincorporated areas, counties with no
subdivisions, and incorporated subdivisions of towns. After
performing data clean-up, I was able to create my
database of U.S. towns:
When the Census reported for both a town and
a "Census Designated Place" (CDP) of the same name, I used
Wikipedia when it reported a higher population than the Census
bureau's town data.
Otherwise, I used the Census Bureau's data.
There were no
fees for the sources and methods I used, but I did become a financial
contributor to the Wikipedia Foundation after this project.