How Scaling Laws Will Determine AI's Future

Transcript

Speaker 0:

The deadline to apply for the first YC spring batch is February 11. If you're accepted, you'll receive $500,000 in investment plus access to the best startup community in the world. So apply now and come build the future with us. Large language models are getting bigger, much bigger. They're also getting smarter. Over the past few years, AI labs have hit on what feels like a winning strategy.

Scaling. More parameters, more data, more compute.

Speaker 1:

Keep scaling the models and they keep improving. You know, just like Moore's Law, we saw the doubling, in performance every eighteen months. With AI, we have now started to see the the doubling every six months or so. But could that be coming to an end? Is the era of scaling finally over?

Speaker 0:

Or are we standing right at the beginning of a brand new scaling paradigm, one that promises to revolutionize AI forever?

Speaker 2:

In.

Speaker 0:

November of twenty nineteen, OpenAI released GPT two, its largest ever model with one and a half billion parameters. The next summer, they released its successor, GPT three, which was something we'd never seen before. Not only was GPT three far more useful and usable, it was also much bigger, over 100 times bigger than g p t two. The era of scaling laws had arrived.

Before g p d three, LLMs were already getting bigger, but it was still anyone's guess whether or not that extra size, data, and compute would be worth it. There was no guarantee that making your model 100 times bigger would also make it 100 times better. What if they started to run into diminishing returns?

It wasn't until January of twenty twenty when Jared Kaplan, Sam McCandlish, and their colleagues at OpenAI released the influential scaling laws for neural language models paper that the field started to take notice. Think of training AI models like a recipe. You have three main ingredients. The model itself, the data it's trained on, and the compute power used to train it.

Larger models have more parameters. These are the internal values of the neural net that are tweaked and trained in order to make predictions. These models are also typically trained on much more data, measured in tokens, which for LLMs are often words or parts of words.

Finally, training these larger models requires computing power, which means more GPUs running for longer using more and more energy. What the scaling laws paper revealed was that by cranking up all three, the parameters, the data, and the compute, the result was a smooth, consistent improvement in model performance in the form of a power law.

Performance, it turned out, depends much more on scale than on the algorithm. Later in the year, more research from OpenAI confirmed that these scaling laws worked for other kinds of models too. On text to image, image to text, and even math, the same scaling laws were there. But back in early twenty twenty, LLM scaling laws were pretty much unknown outside of OpenAI.

That is, except for one person. The anonymous researcher and writer, Bourne, was one of the first people to hone in on what he called the scaling hypothesis. Scale up the size, the data, and the compute, and watch intelligence.

Speaker 3:

emerge. Maybe intelligence really is just like a lot of compute applied to a lot of data, applied to a lot of parameters. Maybe.

Speaker 0:

Moravec and Legg and Kurzweil were right. Warren's post brought scaling laws into the mainstream. And over time, what started as a quiet observation quickly turned into a foundational principle for AI development. But OpenAI's research was just a part of the picture. In 2022, Google DeepMind released their own research on scaling laws, and they added an important missing piece.

It turned out that it's not just about making models bigger, it's also about making sure you train them on enough data. Researchers were looking to find the most optimal model size and training data for a given compute budget. So they trained over 400 models of different sizes with different amounts of data. And what they found was surprising.

Their research suggested that previous LLMs like g p d three were actually undertrained. These models were huge, but they hadn't been trained on enough text to fully realize their potential. To test this, they trained Chinchilla, an LLM less than half the size of g p d three, but with four times more data. And it won. Chinchilla was far better than models double, even triple its size.

These so called Chinchilla scaling laws meant that training the optimal model wasn't just about making the model larger, but also about having enough data to feed it. Chinchilla was a huge milestone on the road to training the frontier AI models we have today, like GPT four o, Claude three point five SONNET, and others.

Labs learned they could trust in the scaling laws and get reliably better and better models. So the future of AI is just bigger and bigger models forever. Right? Well, recently, there's been plenty of debate within the AI community about whether or not we've finally reached the limits of scaling laws.

Some argued that as the latest generation of models have gotten bigger and more expensive, capabilities.

Speaker 2:

have started to plateau. There's a lot of debate, in fact, just in the last multiple weeks. You have we hit the wall with scaling laws? The current generation of LLM models are roughly, you know, few companies have converged at the top, but I think they're all working on our next versions too. We're increasing GPUs at the same, like, rate,.

Speaker 4:

but we're not getting the intelligence improvements at all. Meanwhile,.

Speaker 0:

rumors have leaked out of major labs about failed training runs and diminishing returns. Others have speculated that the lack of high quality data to train new models has also become a major bottleneck.

Speaker 5:

One practical issue we could have is we could run out of data. For various reasons, I think that's not going to happen. But, you know, if you look at it very, very naively, we're not that far from running out data. And so it's like we just don't have the data to continue the to continue the scaling curves. So if the old scaling laws are beginning to lose their edge, what comes next?

What if there were a new frontier for scaling from a brand new kind of model?

Speaker 0:

OpenAI's new class of reasoning models hints at a potential new direction. In a previous video, we explained how o one learns to think through complex problems using its own chain of thought. And OpenAI researchers found that the longer o one was able to think, the better it performed.

It wasn't immediately clear how well this strategy would continue to scale up, but now with the recent release of its successor, o three, the sky seems to be the limit for this new paradigm of scaling LLMs. O three made headlines when it was announced as it smashed benchmarks that were previously considered far out of reach for AI.

From software engineering to math to PhD level science questions, o three easily surpasses the old state of the art results. O three isn't just a small improvement on its predecessors. It's a huge leap. And OpenAI researchers say they have every reason to believe this trajectory will continue. It may even be on a path to artificial general intelligence.

Instead of continuing to scale up the model size during training, it seems likely that researchers will shift focus to scaling the amount of compute available to the model for its chain of thought, also called test time compute.

By letting models think for longer, LLMs like o one and o three can leverage more compute on the fly, scaling up their intelligence when it's needed for harder and harder problems. Scaling pretraining may have plateaued, but by training test time compute, OpenAI may have just opened up an entirely new paradigm for scaling laws, potentially unlocking capabilities we never thought possible.

Large language models are a key piece of the hunt to artificial general intelligence. These same principles of scaling appear to hold for other models too, image diffusion models, protein folding, and chemical models, even world models for robotics, like for self driving. One thing is clear.

It might be mid game for large language models, but we are clearly still in early game for scaling other modalities. Buckle up.

Founder Tools

Need help?

How Scaling Laws Will Determine AI's Future

Transcript

Related Videos

Explore More Content