Extreme Compression with AI: Fitting a 45 Minute Podcast into 40kbs
Way back in 2018, my friend Pete Warden wondered if data compression would be machine learning’s killer app. Looking back, in the age of ChatGPT, this thought seems positively quaint. Artificial intelligence boosters are talking about how AI will generate new medicines, new materials, and new content. But the idea of using AI to compress data is still a powerful one.
I touched briefly on this notion last April:
The only thing that isn’t big about LLMs is the filesize of the model they output. For example, one of Meta’s LLaMA models was trained on one trillion tokens and produced a final model whose size is only 3.5GB! In a sense, LLMs are a form of file compression. Importantly, this file compression is lossy. Information is lost as we move from training datasets to models. We cannot look at a parameter in a model and understand why it has the value it does because the informing data is not present.
(Sidenote: this file compression aspect of AI is a generally under-appreciated feature! Distilling giant datasets down to tiny files (a 3.5GB LLaMA model fits easily on your smartphone!) allows you to bring capabilities previously tied to big, extensive, remote servers to your local device! This will be game-changing.)
I found myself thinking about compression yesterday while reading about OpenAI’s DevDay Announcements. Way down at the bottom (below all the exciting stuff) is a new text-to-speech (TTS) API and an updated version of Whisper (OpenAI’s speech recognition model). Neither application is particularlly novel (you can take your pick of TTS models over at Hugging Face), but the two announcements taken together suggests an comically extreme audio compression pipeline.
With an API account and less than a dollar in credits we can transcribe an audio file into a text file then generate speech from said text file. If we only transmit the text file over the network and run our TTS model at the edge, our bandwidth savings are monumental.
I ran the pipeline on an episode of my favorite podcast, In Our Time, and obtained the following file sizes for each step:
The transcription text file is just 0.08% the size of the original audio file.
And the output doesn’t sound terrible!
Here’s Melvyn Bragg in the original:
And OpenAI’s “alloy” voice in the TTS output:
Look, I’m going to take Melvyn’s cadence and voice any day of the week, but the TTS output is very, very listenable. And in extreme situations where bandwidth is severely limited and only available for short times (think: the International Space Station, Antartic labs, cruise ships, battlefield frontlines, etc.) this kind of compression could be a game-changer.
And there’s nothing to stop us from taking this further. We could deploy customized voice models at the edge, trained on specific speakers. We could add diarization and use different voices for different speakers. Rehydrating text into audio could take different approaches for different sources (like speeding up a slow speaker or emoting a flat one).
There’s a multitude of possibilities in just this one specific niche. How might we expand it to images or video? What other use cases are ripe for applying this type of extreme compression?