Using DuckDB to Map Overture GERS IDs to US Census FIPs Codes
Persistent Geospatial Idenfiers Let More People Work with Geospatial Data
Geospatial data remains too hard to work with. It’s the domain of specialists, out of reach for most people and organizations.
Which is a real shame, because every organization can benefit from geospatial intelligence.
Thankfully a wave of new tools, formats, and data – like DuckDB, Geoparquet, and the Overture Maps Foundation – are making GIS easier.
Persistent identifiers for geospatial things1 are a key enabler in this push to make the field more accessible. With persistent IDs we can join geospatial datasets with one another without the need for repeated, complicated conflation2.
Common IDs allow for simple column joins. And all organizations can do column joins.
I’m lucky enough to work on one such persistent idenfier system, Overture’s Global Entity Reference System (GERS), as part of my job at Precisely. There’s a GERS ID for every entity in Overture’s vast dataset. And since that dataset is open and freely available, anyone can join their data to GERS – provided they know how…
A GERS Demo with DuckDB & the US Census
Today we’re going to connect Overture’s buildings layer to the US Census’s FIPS codes. FIPS codes are used by the Census to identify US geographies (states, counties, and more). During my COVID data analysis work at PlaceIQ, we heavily relied on US county FIPS codes as a join key.
Today we’re going granular than counties and working with Census Block Groups (or CBGs), the smallest geography the Census provides data for. CBGs usually contains between 600 and 3,000 people.
Getting the Data
Let’s grab the CBG data at the Census Cartographic Boundary page. Make sure you’re in the “2023” tab and scroll down to the “Census Block Groups” section. We’re going to perform this job for a single state3, so download the shapefile for your state of choice and unzip in in your project directory.
Let’s visualize the data quickly to make sure it’s what we need. Fire up duckdb
(Install it if you don’t have it! We’ll need it later…) and convert that shapefile to a CSV:
D install spatial;
D load spatial;
D COPY(SELECT STATEFP, COUNTYFP, TRACTCE, BLKGRPCE, geom FROM ST_Read('cb_2023_08_bg_500k.shp')) to 'colorado_cbgs.csv';
Don’t worry about the incantation details above, we’ll cover them in a moment.
Visit Kepler.gl and drop in the resulting CSV:
Looks good!
Though we should note Kepler is making a classic FIPS mistake. FIPS codes are strings, not numbers. Their leading zeroes matter. Kepler is checking the first few records of the data and attempting to infer column types, but it’s interpretting the “08” code for the state of Colorado as an integer and reducing it to “8”. This is a common mistake in many tools (Microsoft Excel only started asking users this year about such conversions).
For the Overture data, we’re going to use the Overture Python command-line tool. Grab a bounding box for Colorado4 and run the following command from our project directory:
$ overturemaps download --bbox=-109.03,36.97,-102.02,41.01 -f geoparquet --type=building -o co-buildings.parquet
A few moments later we have 2,648,232 building polygons.
The Join
We’re going to use DuckDB and its Spatial extension to perform the join5. Here’s the query to do so:
Let’s break this down, line by line:
- Lines 1-2: We install and load the
spatial
extension. Installing only needs to be done once in your environment. So if you havespatial
installed, you can skip this line. - Line 3: The
COPY
command exports the results of our following query to an external file, in this casecolorado_buildings_gers_to_fips.csv
, which we specify on line 11. - Line 5-6: We’re want two columns for each respective identifer, GERS and FIPS. GERS is simple, but FIPS is an identifier with multiple components making up the entire blockgroup identifier. Here, we’re concatoning all of these columns into a single FIPS blockgroup ID.
- Line 7: We downloaded our Overture data as a Geoparquet file, but DuckDB doesn’t yet support this variant. As a result, we have to convert the
building_geom
information from a WKT blob (a format for storing geometry data) to a format DuckDB can parse, in this case itsGEOMETRY
type, using theST_GeomFromWKB
6 function. If we don’t do this, DuckDB won’t be able to perform the spatial analysis necessary for our join. - Line 8-9: Our datasources. We can load the Overture file directly, but we need to use the
Spatial
functionST_Read
to pull in the CBG shapefile. - Line 10: Our join. Because we’re joining two geospatial polygons, we don’t need a super complex conflation pipeline. We just need to check if a building is contained within a block group shape. There’s a little nuance here, though, since buildings on the edges of a block group shouldn’t fall into multiple block groups. So our naive methdology is to find the center point of the building shape with the
ST_Centroid
function, then check if that point is contained within a block group shape with theST_Contains
function. A more complex conflation routine might check the amount each building overlaps with a block group, allowing it to assign a building to the block group it most overlaps with, but for our purposes this is more than good enough.
In my environment, with 12 cores and 16GB of RAM, this query took about 2 minutes to run. Check it out:
We now have a CSV file mapping Census FIPS to Overture Building GERS IDs. If we wanted to enrich Overture buildings with Census data, we could repeatedly use this file as a join table.
We can share it with others so they don’t have to wrangle the data, letting them accept either FIPS-keyed or GERS-keyed data without touching a spatial function. (Heck, if you email me I’ll send you the file…)
Our exercise here is rather simple, but for more complex conflations – like joining points-of-interests – the benefits of common ID systems are even more pronounced.
Interchange formats are powerful and make fields more accessible and marketplaces bigger. The geospatial world has so much potential energy that’s about to be unlocked.
-
Geospatial entities are buildings, parcels, roads, houses, cities, counties, parks, billboards, water fountains, tennis courts, or any other thing you could pin down on a map. ↩
-
Conflation is the process of matching two similar, but often not identical, datapoints which describe the same real-world thing. A common conflation pipeline for points-of-interest will compare coordiante pairs, names, addresses, phone numbers, and other attributes to determine if two records are the same. This process is complicated, error-prone, and time-consuming. A data scientist at PlaceIQ used to lament, “You’d be amazed how many different ways people manage to spell ‘Walmart’.” ↩
-
If you do want a full national join, check your RAM. The Overture Buildings export for the entire United Space clocks in at >24 gigs – that’s ~174 million buildings! If you are in need of a national CBG join, it’s probably wiser to use a tool like BigQuery. ↩
-
Colorado is a common location for geospacial data demos…because it’s a rectangle. A standard bounding box can almost perfectly define it. ↩
-
DuckDB’s Spatial extension is an absolute gift. It wires into the holy trinity of OSS geo: GEOS, GDAL, and PROJ. And it installs all of those with a single command:
INSTALL spatial;
. If you’re relatively new to GIS or just arriving, I cannot beging to convey how groundbreaking this is. ↩ -
The
ST_
prefix is a common convention in spatial databases. It stands for “Spatial Type” and is used to denote functions that operate on spatial data. ↩