The Engineering Unlocks Behind DeepSeek | YC Decoded
In this episode of YC Decoded, General Partner Diana Hu breaks down the key engineering optimizations behind DeepSeek's remarkable new models — and contextualizes them within the broader history of recent AI breakthroughs.
Transcript
There's a new AI model in town. Chinese AI company DeepSeek recently made waves when it announced r one, an open source reasoning model that it claimed achieved comparable performance to OpenAI o one at a fraction of the cost. The announcement unleashed a wave of social media panic and stock market chaos. NVIDIA losing nearly $600,000,000,000 in market cap today alone.
But for those following AI developments closely, DeepSeek and r one didn't come out of nowhere. The company has been publishing its research and releasing its model weights for months, following a path similar to Meta's LAMA model. This is in contrast to other major AI labs like OpenAI, Google DeepMind, and Anthropic that have closed weights and publish more limited technical reports.
What's changed is just that now the broader public is actually paying attention. So let's decode what the real developments here are, where they come from, and why they matter. First of all, it is important to distinguish between two relevant models here, DeepSeq r one and DeepSeq v three.
DeepSeq v three, which was actually released this past December, is a general purpose based model that achieves comparable performance to other based models like OpenAI's GPT four o, Anthropic's Claw 3. 5 SONNET, and Google's Gemini 1. 5. DeepSeq r one, which was released at the January, is a reasoning model built on top of DeepSeq v three.
In other words, DeepSeq took v three and applied various algorithmic improvements to it in order to optimize its reasoning ability, resulting in r one, a model that's achieved comparable performance to OpenAI's o one and Google Flash two point o on certain complex reasoning benchmarks.
But many of the algorithmic innovations responsible for r one's remarkable performance were actually discussed in this past December v three paper or even before that in DeepSeq's v two paper, which was published in May 2024 or the DeepSeq math paper, came out February 2024.
V three stitches together many of these innovations which were designed primarily with compute and training efficiency in mind. One way DeepSeek optimized for efficiency and got more floating point operations per second or FLOPS from the GPUs was by training v three natively in eight bit floating point formats rather than the usual 16 bit or 32 bit formats. This is not a new idea.
Many other labs are doing it too, but it was key for getting such massive memory savings without sacrificing performance. A crucial enhancement is their FP eight accumulation fix, which periodically merges calculations back into a higher precision FP 32 accumulator to prevent small numerical errors from compounding.
The result, far more efficient training across thousands of GPUs, cutting costs while maintaining model quality. But why does this efficiency matter? Given its hardware constraints and US exports controls on the sale of GPUs to China, DeepSeq needed to find a way to get more training and more bandwidth from their existing cluster of GPUs.
You see at AI Labs, these GPUs which do number crunching and matrix multiplication to train these models are actually sitting idle most of the time. At f p eight, it is typical to only see around 35% model FLOPS utilization or MFU, meaning GPUs are only being utilized at peak potential about a third of the time.
The rest of the time, these GPUs are waiting for data to be moved either between caches or other GPUs. This is NVIDIA's key advantage. It is not just about GPUs. It is about an integrated solution they've been building for over a decade that includes the networking with InfiniBand, software with CUDA, and developer experience.
Essentially, NVIDIA provides a deeply integrated system that lets AI researchers program GPU cluster less as a distributed system and culture to what Jensen describes is one giant GPU. Another clever way DeepSeek makes the most out of their hardware is their particular implementation of a mixture of experts' architecture.
DeepSeek v three has 671,000,000,000 modern parameters, but only 37,000,000,000 are activated for a given token prediction. By contrast, the largest and most capable LAMA three model doesn't use a mixture of expert architecture, so it activates its full 405,000,000,000 for each token prediction.
In other words, v three activates 11 x fewer parameters for each forward pass, saving tons of computation. Mixture of experts isn't a new concept, but it's been challenging to train models with this architecture efficiently. DeepSeq introduced novel techniques that stabilized performance and increased GPU utilization.
Additionally, to overcome key performance bottlenecks, v three makes use of multi head latent attention or MLA, which DeepSeek first revealed with its v two paper, which was published in May 2024. MLA is a solution designed to tackle kvcatch storage limitation, one of the biggest sources of DRAM overhead in large models.
Instead of storing full key and value matrices, MLA manages to compress them down into a latent representation, reconstructing them only when needed. This helped the v two model reduce its kV cache size by 93. 3% and boosted its maximum generation throughput to 5. 76 times. Finally, unlike traditional models that predict only the next token, v three makes use of multi token prediction or MTP.
MTP enables v three to anticipate multiple future tokens at each step. This densifies training signals providing more feedback per step for better data efficiency and faster learning. It also improves representation planning, allowing the model to pre plan sequences for smoother, more coherent outputs.
During inference, MTP modules can be repurposed for speculative decoding, reducing sequential processing steps and significantly speeding up generation. Taken altogether, this makes v three one of the most impressive base models on the market, and it's been out for some time now. However, the recent release of DeepSeq's r one reasoning model is what really made waves.
Most LLMs can be improved by being prompted to think step by step, but what sets reasoning models apart is that they are specifically trained to break down hard problems and think about them for paragraphs at a time. In September, OpenAI showed the power of this new approach with o one. This achieves state of the art results in math, coding, and science benchmarks.
With r one, DeepSeek took a similar approach and published the secret sauce. OpenAI and DeepSeek achieved their impressive results through reinforcement learning, a technique to shape an LLM's behavior based on feedback and reward signals.
Modern LLMs use some variation of reinforcement learning with human feedback, aka RLHF, or reinforcement learning from AI feedback, aka RLAIF, to improve their models' usefulness and alignment. But reasoning models apply RL specifically towards the task of thinking step by step through complex problems. So how did DeepSeek apply RL to get a reasoning model?
At a high level, they assemble a bunch of problems with verifiable outputs, especially in math and coding problems, and then design a training pipeline to get the model to think for a bit and output the correct answers. But they don't give the model any external examples of how to think, whether from humans or AI. And their grading process was extremely simple.
Rather than using a complex AI to give the model fine grained feedback, DeepSeq uses simple rules to evaluate the model's final output on accuracy and formatting. They use these output scores to update their model through a novel technique they published in February 2024 called Group Relative Policy Optimization or GRPO.
Remarkably, with this process alone, DeepSeek saw reasoning emerge over thousands of RL steps. The model learned skills like extended chain of thought and even experienced a moment where it recognized its own mistakes and backtracked to correct its reasoning. This model was R one zero, one of the first large models to achieve top tier results purely through reinforcement learning.
Pure RL has long been a subject of investigation in Western research labs, such as DeepMind's AlphaGo, which simulated thousands of random games of self play to beat Lee Sedol, the world's top Go player in 2016. In 2019, OpenAI achieved notable success using reinforcement learning to train a robotics hand to solve the Rubik's cube and beat a top human team in competitive Dota two.
But unconstrained by human examples, r one zero's thinking steps suffer from poor readability, switching between English and Chinese at random. So DeepSeek introduced a cold start phase, fine tuning unstructured reasoning examples before r l to get R1. This eliminated the language mixing issues and made outputs far more comprehensible. The results are impressive.
R1 achieves comparable performance to O1 on certain math and coding benchmarks. But the pace of innovation is speeding up. Just two weeks after r one was released, OpenAI released o three minutei, which outperforms r one and o one on key benchmarks. So if r one didn't actually come out of nowhere, what explains the hype cycle? One explanation is the sheer accessibility of DeepSeq's model.
R one is freely accessible through their website and app, and it is free to download, run locally, and customize. Also, because of all the efficiency improvements, it offers near state of the art performance at a fraction of the price of other reasoning models.
Another explanation is that a lot of the hype cycle didn't have do with the specific algorithmic improvements that we described, but with misconceptions around v three's alleged $5,500,000 in training costs. There's some important fine print here. The 5,500,000. 0 figure refers only to the cost of the final training run for v3.
It doesn't include any of the training costs of R1 or the associated R and D or hardware operating expenses, which are presumably in the hundreds of millions. Given the extreme algorithmic optimizations here, that 5,500,000. 0 training round number actually seems perfectly possible. And it is worth noting that this work is reproducible.
A UC Berkeley lab recently applied r one zero's key techniques to produce complex reasoning in a smaller model for just $30. What DeepSeek really proves is that there is still room for new players on the frontier. In particular, there's room for rebuilding the stack for optimizing GPU workloads, improving software at inference layer tooling, and developing AI generated kernels.
Ultimately, this is fantastic news for AI applications in consumer or b to b since it means the cost of intelligence keeps going down. So the big takeaway here, this is the best possible time to be building a startup.
The deadline to apply for the first YC spring batch is February 11. If you're accepted, you'll receive $500,000 in investment plus access to the best startup community in the world. So apply now and come build the future with us.
✨ This content is provided for educational purposes. All rights reserved by the original authors. ✨
Related Videos
You might also be interested in these related videos