We’re finally coming to terms with the idea that foundation LLMs have hit a wall.

Thanks to decades of data creation and graphics innovation, we advanced incredibly quickly for a few years. But we’ve used up these accelerants and there’s none left to fuel another big leap. Our gains going forward will be slow, incremental, and hard-fought.

Reviewing the history of machine learning, we can both understand how the field advanced so quickly and why LLMs have hit a wall.

A sample of ImageNet images, organized by their embeddings

The Internet Created Big, Open Datasets That Led to Breakthroughs

Data is a dependency for machine learning and AI progress.

In most computer programming, you explicitly write down the instructions that define your program. But with machine learning, we point a program at a pile of data and ask it to figure it out. The software comes up with rules, in the form of a model, which we then use to process new bits of data.

Sure, we’re glossing over the details, but this general pattern illustrates how machine learning – which includes LLMs – is not only limited and enabled by software and hardware, it is also limited and enabled by data. If there’s not much data or the data is of poor quality, the rules defined by machine learning software will be garbage.

Because access to data is a governor for machine learning, it’s possible to recount the history of machine learning by touching on three key datasets: MNIST, ImageNet, and Common Crawl. Each allowed for a major breakthrough to occur, proving the viability of machine learning and setting off countless new developers and investors into the domain.

MNIST: A Small, Specialized Dataset Originally Delivered By Mail

Back in 1994 –– when the Internet was nascent and the number of websites numbered in the thousands –– the National Institute of Standards and Technology published a dataset of handwritten digits, which they distributed on two CD-ROMs. At the time, the dataset was a goldmine, but it wasn’t perfect. Yann LeCun – who had been working on neural networks that could read handwritten numbers at Bell Labs – tweaked the original NIST dataset to produce a more representative mix of samples, pre-formatted for neural network usage: numbers were centered into 28x28 pixel images, anti-aliased, and divided into ‘test’ and ‘train’ subsets.

MNIST illustrates what good data looked like before the rise of the Internet. Government departments had both the rare budget and the access to assemble a dataset from handwritten digits sampled from Census employees and high schoolers. The data was distributed via post, on multiple CD-ROMs. The datasets needed to build models capable of turning machine learning into an industry didn’t exist yet.

The new data set, MNIST (modified NIST), was too large for LeCun current software. So he wrote a new version tailored for the dataset that delivered a groundbreaking error rate of 0.8%, a watershed moment for machine learning in the industry. AT&T used the software to read more than 10% of all checks deposited in the US, at the time.

"Hello MNIST."

Training a network off of MNIST became the "Hello World" of machine and deep learning. The dataset is included in nearly every machine learning framework and is frequently featured as the first project in many textbooks.

ImageNet: More Than a Million Images From and Categorized By Internet Users

14 years later –– smack-dab in the middle of the Web 2.0 era –– Fei-Fei Li, a computer science professor at Princeton, “became obsessed with an estimate by vision scientist Irving Biederman that the average person recognizes roughly 30,000 different kinds of objects.” Li wondered if she could build a comprehensive image dataset for neural network training. She began working on ImageNet, choosing categories from another dataset, WordNet, to build a shopping list for images.

To build ImageNet, Li hired crowdsourced workers on Amazon’s crowdsource platform, Mechanical Turk, to download images from Google Image Search and label them appropriately. Over two years, Li and her lab built and shared a dataset with 1,000 categories and 1.4 million images. To garner attention, Li began hosting a competition to correctly categorize ImageNet images using only software. After a middling couple of years, a 2012 neural network entrant named AlexNet achieved a score of 84.7%, 10.8 percentage points better than that of the runner-up.

AlexNet, written by Alex Krizhevseky with Ilya Sutskever (who would later co-found OpenAI) and Geoffrey Hinton (who just won a Nobel Prize), was a watershed moment that put neural networking firmly on the map. Its arrival seemingly reversed the declining trend of machine learning interest and began it on its upward path. Krizhevseky and co. were able to build AlexNet only because Li built ImageNet with Google and an army of crowdsourced workers.

The Internet allowed for the creation of datasets LeCun could only dream of, enabling broader applications and breakthroughs.

Common Crawl: The Internet Itself, Packaged as a Dataset

In 2007 –– the same year Li began work on ImageNet ––the Common Crawl Foundation was founded by Gil Elbaz. Elbaz’s first venture, Applied Semantics, created AdSense before it was bought by Google. After leaving Google, Elbaz formed Common Crawl out of a, “desire to ensure a truly open web.”

The Common Crawl dataset is a massive open dataset. It contains information from more than 250 billion webpages, collected over 17 years. 3-5 billion pages are added a month. Over a decade after its founding, as people began to realize the benefits of building larger and larger LLMs, Common Crawl became a natural starting point for assembling LLM training datasets.

Like the original NIST handwriting dataset, Common Crawl was unwieldy for model builders. It was designed for researchers studying the web and programmers building new search platforms (remember, machine learning remained a niche field in 2007). So teams began filtering and preparing Common Crawl to easily compare results and save time as they iterated on model training techniques.

Google’s C4 dataset, prepared for the training of their T5 LLMs, is a great example and commonly used Common Crawl variant. To prepare Common Crawl data for model pre-training, the team building C4 filtered out sentence fragments, boilerplate content (cookie alerts and privacy policies, for example), duplicates, source code, and offensive language, yielding a ~750GB subset. The filtered C4 dataset outperformed the unfiltered dataset by every metric.

Common Crawl is a foundational dataset of the LLM age. 60% of GPT-3’s training data is from Common Crawl. It makes up 18% of The Pile, an open dataset used by Micrsoft, Meta, Apple, Yandex, and others to train their models. But despite its giant size and continued growth, Common Crawl has less access to web content today than it did pre-ChatGPT, whose launch spurred media and social platforms to reevaluate their licensing terms. Researchers estimate that 25% of the highest quality data is no longer available to Common Crawl.

So where do we go from here? Projects like C4 and The Pile proved bigger datasets weren’t always better. Smaller models, tuned only on the best subsets, show competitive results at a fraction of the model size.

There isn’t a game-changing dataset out there ready to spur the field like MNIST, ImageNet, or CommonCrawl did. While vertical-specific datasets will emerge (for example, the Overture Maps Foundation datasets) and companies will spend fortunes accruing user feedback, we’ve already used the Internet –– the largest general dataset of them all.

A screencap from Starfox, a SNES game that used the Super FX chip, a coprocessor designed for rendering polygons. Such functions would eventually be absorbed by GPUs.

The Graphics Industry Funded the Development of GPUs That Let Us Process Giant Datasets

As datasets for training machine learning models grew, they required faster hardware. After extensive experimentation, GPUs –– which had evolved over decades to perform rapid computations for 2D and 3D graphics –– proved ideal for machine learning.

So ideal, in fact, that the AI gold-rush has benefited Nvidia more than any other company. The GPU maker’s market capitalization has grown by ~260% since ChatGPT’s launch, and currently sits at over $3.6 trillion. Nvidia’s journey to this position only recently had anything to do with machine learning. For the greater part of 3 decades, GPU innovation was paid for by the videogame marketplace.

The Constant Need to Perform Pixel Math

Initially, personal computers didn’t have dedicated graphics processors. The original Macintosh –– the first mass market personal computer with a graphical interface –– rendered its monochrome, 512x342 resolution screen (that’s 1/17th as many pixels as an iPhone 16) entirely by its CPU. It did this while also listening to the keyboard and mouse, managing the disk drive and RAM, and running applications.

If we wanted higher resolution color screens –– and we most certainly did –– our computers would have to work much, much harder to calculate the many possible values for hundreds of thousands of pixels, at least 60 times a second. Rendering a color version of the Macintosh’s screen – in 8-bit color with 256 possible values, the same as the original Nintendo –– required processing 8 times more memory and processing as a monochrome screen.

3 years after the launch of the Macintosh, in 1987, Apple shipped a Mac that could render 8-bit color. The Macintosh II achieved this milestone by including a “graphics card”: a separate device plugged into the motherboard with its own RAM and processor. The graphics card only performed pixel math –– figuring out what value each pixel should have, 60 times a second.

Throughout the late 80s and 90s, graphics hardware continued to specialize. Better cards and chips got you more pixels, more colors, and faster refresh rates. In the 90s gaming began to truly influence the market, spurring GPU card makers to add capabilities for 3D computations – transforming and clipping shapes, dealing with light and shading. These advancements, along with the necessary software adoption, allowed GPUs to render images not simply draw them according to CPU instructions.

The goal was the same – draw better graphics, faster – and the card makers ruthlessly optimized towards that goal. Consider the Nvidia 8800 GTX, which launched in 2006. This card was a monster, the fastest GPU by a wide margin when initially released. It had 128 1.8 GHz processors. Compare this to the Intel Core 2 Extreme, which landed in late 2006. It had only four 2.66 GHz cores. CPUs have a few big, generalized cores. GPUs have tons of small, simple cores.

The optimization of the GPU was chiefly paid for and influenced by the videogame marketplace. Gamers’ insatiable appetite for better graphics created a market for frequently updated cards, in arcade machines, home consoles, and PCs. As 3D gaming began, chips emerged just to handle polygon and lighting math (the Super FX chip that powered Star Fox being a notable example). These functions were eventually merged into the GPU itself.

The first requirements for machine learning arrived in 2001 with the GeForce 3, the first chip capable of programmatic shading. At a high level, this let developers define a tiny short program, which could include assets like images as inputs, which would be run to compute a pixel’s value. Previously, they could only choose from a handful of predefined functions; now the primitives were exposed for writing your own functions. These functions couldn’t be as generic as CPU functions; anything that could be expressed as pixel or 3D math could be run on the GPU’s copious cores.

Preparing Other Problems for Pixel Processors

In 2000, Stanford grad student Ian Buck built an 8K gaming rig using 32 GeForce cards. Buck was working on using distributed graphics systems to render larger displays and had to develop entirely new systems for coordinating this GPU computation. He followed this thread beyond the pursuit of bigger and better displays into more general computing use cases, culminating with his 2004 paper, “Stream Computing on Graphics Hardware.” He spells out the challenge in the abstract:

As the programmability and performance of modern graphics hardware continues to increase, many researchers are looking to graphics hardware to solve computationally intensive problems previously performed on general purpose CPUs. The challenge, however, is how to re-target these processors from game rendering to general computation, such as numerical modeling, scientific computing, or signal processing. Traditional graphics APIs abstract the GPU as a rendering device, involving textures, triangles, and pixels. Mapping an algorithm to use these primitives is not a straightforward operation, even for the most advanced graphics developers. The results were difficult and often unmanageable programming approaches, hindering the overall adoption of GPUs as a mainstream computing device.

The problem is, GPUs have tremendous computing power but only speak in graphics. In the paper, Buck presents his solution: Brook for GPU, a programming system for more easily writing general-purpose computation functions and translating them into GPU code. If your work could cosplay as a pixel problem, it could run really, really fast.

Unsurprisingly, Buck was hired by Nvidia in 2004 (where he remains today). There, he reckoned with Brook’s short-comings and began a project that would fix them: CUDA, which launched in 2007.

Initially, no one was quite sure what to use general-purpose GPU computing for: cryptography, oil and gas exploration, stock market models, biology simulations, and physics simulations were all shotgunned out as potential applications. Nvidia trotted out prototypes and mock-ups, demonstrating physics simulations and biology toys, but they garnered little interest outside a few slices of academia.

Everyone knew CUDA was fast, but no one knew what it was for.

Which brings us back to 2011, to ImageNet and AlexNet, a defining moment in machine learning in more ways than one.

Krizhevseky, Sutskever, and Hinton built AlexNet using CUDA and two Nvidia GTX580s. The GTX580 was a stock consumer gaming card. It cost $500 for the top-tier model with 3 GB of RAM, a notably large size which enabled Kirzhevseky and team to fit their network across two cards. When AlexNet ran away with the ImageNet contest, it not only illustrated the capabilities of neural networks but demonstrated how GPUs were essential tools for the job.

By 2015, everyone knew what CUDA was for and Nvidia focused the project entirely on the neural network use case1. Their early and constant commitment paid off, as CUDA was ready for its time in the spotlight. Higher-level deep learning frameworks –- like TensorFlow and PyTorch –– made CUDA even easier to use, bringing GPU-acceleration to a much, much bigger pool of programmers.

LLMs Were Built With Three Decades of Internet Content & Graphics Innovations

CUDA2 granted incredible, affordable computational performance –– honed over three decades in the graphics industry –– to the machine learning field. CUDA was Prometheus, stealing fire from the gamers and giving it to machine learning nerds. The field accelerated as we figured out how to make models from bigger and bigger datasets. With more minds able to play with machine learning, we discovered new techniques that let us build Large Language Models, or LLMs3, starting in 2018 Google’s BERT in 2018.

But we consumed these gifts –– an Internet’s worth of content and seven gaming console generations’ worth of graphics horsepower –– in only a few years. There isn’t another sector riding in tomorrow with the gifts needed to fuel our next breakthrough. Now we have to take our innovations in real time.

Recent trends demonstrate this pace:

These (and more) are irons in the first that could yield big gains. But I’d wager we’ll instead see consistent, incremental results in LLM capability.

I don’t think this is a bad thing! (Unless you’re an company whose valuation hinges on delivering AGI.) I continue to believe we have incredible untapped potential in the current models that we’re only now learning how to apply. We could probably freeze LLM development and build valuable apps off the current state for years. As designers, developers, and other builders learn how to apply AI cogs selectively –– delivering ‘quiet’ AI features –– we’ll get improved existing tools and wholly new ones.

LLMs have hit a wall. Now begins the slow climb upward.

  1. So much ink is spilled on Nvidia and its incredible market cap, but I continue to think we underappreciate the lessons from CUDA. Nvidia’s leadership kept investing in an R&D project in the hope that a new use case would emerge for their hardware, despite it not paying off for nearly a decade. And they didn’t try to force a use case – the net was cast wide until it was absolutely clear that ML training had immense potential. Finally, they didn’t fall for crypto. In earnings call after earnings call they downplayed revenue from cryptofarms hoarding GPUs –– they even saw it as a problem as it was incredibly frustrating for their gamer customers. CUDA is perhaps the primary reason Nvidia, and not AMD or Intel, is worth trillions of dollars. 

  2. The multiplatform OpenCL framework why it didn’t find traction like CUDA is beyond the scope of this post. 

  3. If you want to learn more about how exactly LLMs work – how they turn language into pixel math in order to build the models behind your favorite chatbots –– I highly recommend 3Blue1Brown’s recent short primer on the topic