No, 80% of data isn’t spatial (and why that is a good thing)
Aug 23, 2024Earlier this year I wrote a post which addressed the ubiquitous phrase in the geospatial industry that 80% of data contains a spatial component. The problem with this statement is that there is no single source where this stat line came from, even though it has been used widely for years. Some have tried to track this down, but I wanted to do some digging of my own. It seems that the trail ends here:
The article by Williams lists no sources or any indication where the number comes from. However, a little digging into GEOMAX reveals that the program was developed in 1985 by two academics at the University of Florida in 1985. John Alexander, a professor of urban and regional planning, and Paul Zwick, a research scientist were behind the effort to digitize maps at Alachua County so perhaps the knowledge of where the phrase originates lies with them?
However after I shared that article I got an anonymous tip which said that someone actually made that up on a panel. Jack Dangermond was also on that panel and continued to use that quote from that point forward. It seems that the urban legend of this quote continues on.
But the good news is that we can actually analyze this statement using real data. And that is what I did this past week. I went through several open data portals with a large sampling of data that also have search capabilities that made it easy to evaluate these claims.
Not all spatial data is made equal
One of the other elements of the 80% claim that has made me skeptical of it is the term “spatial component”. This could mean anything from a highly detailed raster cell up the name of a country, and everything in between. I get it, a country name is a location and that it can be mapped. But for me there is a difference between what I would call actionable spatial data and non-actionable. For me that means a location that is either already in a spatial file type or that can be turned into a granular geometry such as a point, line, or polygon (an address string is a good example of this, a name of a city is not).
To be more specific:
Actionable
- Point
- Line
- Polygon
- Raster
- Address string
- Latitude/longitude float
- Census ID string (i.e. Census Tract ID)
Non-actionable
- Country name
- City name
- State/province name
- Postal code
- Place/feature name (i.e. Mississippi River)
So armed with that approach I went out to several of the major data portals to see what I could find. Let’s start with the places with the least spatial data and work our way up.
Kaggle
Kaggle is one of the most popular websites for data science competitions and along with that comes data that can be used for a variety of data science projects. This includes things like image classification on top of tabular data. The raster data in Kaggle was negligible so I focused on tabular data.
My first approach was to use their API to search through the various datasets, but this would require looping through all 368,945 datasets, downloading them, skipping the non-tabular files, loading the tabular files and reading the headers, and finding which ones have a column with a location component. I got through a few hundred of those before I realized this would take way to long.
Using their search feature I found that there are 170,553 CSV files and about 8,117 have some location component with some possible overlap (meaning one file could have a latitude/longitude column and an address column).
Total spatial data: 4.8%
Australia Open Data
Most of the open data portals made it easy to see which files were spatial in nature since you can filter by file extension. ZIP files are tricky since they might be Shapefiles so I had to determine if they used ZIP to tag Shapefiles or if they use a separate tag.
In total I found 28,640 spatial file types out of 107,908 comparable data types.
Total spatial data: 26.5%
UK Open Data
The UK Open Data site made it a bit harder to filter through all their data since you could only select one data type at a time. But nonetheless here are the results:
- 14,044: CSVs
- 2,654: Excel
- 2,801: ArcGIS Rest Services
- 4,854: GeoJSON
Total spatial data: 31.4%
GitHub
In GitHub you can search by path extension. In this case I limited it to only GeoJSON and CSV since GitHub is a code platform and not a data platform, even though lots of data file inevitably make their way there.
- CSV: 46.9M
- GeoJSON: 27.1M
Total spatial data: 36.6%
Data.gov
Now comes one of the trickiest ones that I had to work with and the one that I probably spent the most time looking into. Data.gov is a U.S. website that tries to pull together open data across all federal state and local municipalities into one single location. At first glance, it seemed quite easy to figure out which data sets are spatial in nature because it has a single feature tag called a geospatial.
The weird thing is that it is the only tag in the “Dataset Type” category. Once you click that you get 231,187 files out of 291,073 which gets you 79.4% so
Total spatial data: 79.4%
Hold on just one minute. That was almost too easy. And here is one of the issues with Data.gov - there are a ton of datasets which have the file extension XML, many of which lead to websites that may or may not have spatial data. On top of that one dataset may have more than one format which would make it easy to double count.
Total spatial data: 79.4%
So what can we do? Well after some digging I found that there is a Data.gov API that, while lacking great documentation, can allow us to query by chaining many file types together. For example here are all the different file types from the API:
https://catalog.data.gov/api/3/action/package_search?facet.field=["res_format"]
And here is the search I used to filter by just the spatial types:
https://catalog.data.gov/api/3/action/package_search?fq=res_format:TIFF+OR+WMS+OR+ESRI REST+OR+XYZ+OR+KML+OR+GeoJSON+OR+WFS+OR+QGIS+OR+GML+OR+WCS+OR+NETCDF+NOT+XML+NOT+CSV+NOT+XLS+NOT+NULL+NOT+XLSX
This excludes CSV, Excel, and XML but would not exclude a file if it contained one of those options and one of the specific spatial types. This gives us
- 291,072: Total files
- 145,761: Spatial files
Total spatial data: 50.1%
EU Open Data
The European Union Open Data portal made it a bit easier to search across multiple formats and select more than one option at a time, although similar to Data.gov there could be some overlap in some cases.
- 889,197: Non-spatial files
- 1,053,716: Spatial files
Total spatial data: 54.2%
BigQuery Public Datasets
BigQuery made it easy to get the most accurate count since there is not a ton of data but also that I could create a simple Python script to query every single table and see if there was a geometry or lat/long columns in the data.
- 187: Spatial datasets
- 147: Non-spatial datasets
Total spatial data: 55.9%
New Zealand Public Data
Finally, New Zealand made it easy to test this out just using their website and came out on top with the highest percentage of spatial data.
- 19,663: Spatial datasets
- 34,032: Total datasets
Total spatial data: 57.8%
Others
I am still trying to make use of the dataset containing information about all the data in the Canadian Open Data Portal and I have a request out to the Google Dataset Search team since their search is only text based and seems to stop at around 150 total results on the website. This could be the best test since they claim to index 25 million open datasets.
What this all means?
Well, the first thing is I think we put the notion to rest that 80% of data contains a spatial component. I think that is a fast oversimplification and given all the information I've been able to find. It seems like one of those urban myths that has no basis in reality.
But the good news is after looking at all of these different public data sets and information that's available online, there is still a lot of spatial data to use. Even though the total percentage might be lower, all of this data has a true actionable spatial component to it. This means that it can be immediately used in some sort of advanced spatial analysis, visualized, and used across many different use cases and technologies.
The point being that spatial data is still widely available and very good spatial data is available in pretty large volumes. If we can take away anything from this is that there is absolutely a need for skills and individuals that can use and create actual insights from this sort of data.
What I think the challenge posed is that we, as spatial practitioners, have to do a much better job. job of making use of this data and bringing it into a wider environment. The fact that Kaggle only has about 5% of spatial data tells me that this is not widely used in many different fields.
So that is the challenge in front of us. If you're interested in learning about how to use spatial data and perform spatial analytics at scale, make sure to check out the Spatial Lab, which is my membership community with lots of individuals working on these types of problems and others. We're in the middle of a data engineering sprint for geospatial data, so now is a great time to sign up and join the Lab.
Get the Spatial Stack Newsletter sent right to your inbox.
We hate spam. We will never sell your information, for any reason.