On Authentic Data
The emergence of the term “authentic data” highlights a new problem
I’ve been hearing the term “authentic data” more and more, these days. At the GeoBuiz summit last week, the term was uttered enough to hit a critical mass in my head, helping me realize its novelty.
Let’s quickly define it:
Authentic Data
Data that is collected from real-world events, interactions, or observations. Authentic data is *not* artificially generated or manipulated, by LLMs or other automated mechanisms.
The rise of “authentic data” illustrates a new concern: the emergence of “generated data”, data created not from observations but from machine learning and AI models. It’s a necessary distinction – one we didn’t have to make until recently – though its importance will vary by use case.
Generated Data
Data that is artificially created, often by AI models, and used to augment authentic datasets or simulate real-world scenarios.
There is a growing wariness of “generated data” among data analysts and enterprises. Just as there are fears that AI slop will poison the internet, rendering it difficult to use for both humans and machines, there is anxiety that generated data will undermine analyses and lead to poor decisions.
Data provenance concerns are not new. When we were building PlaceIQ, prospective clients would regularly ask us to include “raw data,” a supposedly natural state representing data at the moment it was created. However, in practice, the line where “raw” data becomes “cooked” was often fluid.
Data is created, not handed down by god.
At PlaceIQ, our understanding of movement in the real-world could be traced back to the signals collected from GPS, Bluetooth, and cellular antennas. From there, the operating systems – iOS and Android – would assemble a best guess at a coordinate pair, which an SDK or application would selectively log. At PlaceIQ, we’d interpret these streams as anonymized visitation and roll this up into packaged datasets. Where in this pipeline “raw data” exists means something different, depending on the client. Its definition was always a bit of a vibe, reflecting the mental model each buyer had for how the data was created and what it represented.
But “raw data” cannot adequately represent AI slop anxieties. Hence the rise of “authentic data”, a wholly new term which will certainly influence the data ecosystem. I expect data products will need to convincingly document their provenance and sales materials will need to provide clear narratives supporting the “authentic” status of each dataset (because it will be near impossible to discern objectively). There will surely be scandals, where data producers will pass off generated data as an “authentic” product, validating industry concerns.
This is a term to watch. And don’t forget:
If you want to know where the future is being made, look for where language is being invented and lawyers are congregating.