A Comprehensive Overview of Its Training Approach and Capabilities

Please find the DeepSeek R1 paper here

DeepSeek R1 has been making waves with its exceptional reasoning abilities — rivaling models like OpenAI’s “o1” in several complex tasks. Some folks have even spread the rumor that DeepSeek was trained on just a few GPUs in someone’s basement, which couldn’t be further from the truth. In reality, as corroborated by details in the DeepSeek paper, the team utilized a sophisticated, large-scale training pipeline. Below, I’ll explore why DeepSeek needed those substantial resources and how the team’s innovative multi-stage process (anchored by the Group Relative Policy Optimization, or GRPO) propelled it to state-of-the-art levels in math, coding, and beyond.

The DeepSeek Architecture and Scale

According to the paper’s Section 2 (“Model Architecture and Pretraining”), DeepSeek R1 builds upon a highly capable base model (DeepSeek V3). DeepSeek V3 was already trained on a wide-ranging corpus of text, covering everything from general web documents to specialized math and coding resources. This broad coverage laid a firm foundation for the advanced fine-tuning steps that would eventually yield R1.

The paper clarifies that the parameter count of the base model sits in the tens of billions, placing it on par with other large-scale LLMs. Training (and fine-tuning) a model of that size is no trivial matter — teams typically rely on dozens, if not hundreds, of high-end GPUs, each with large memory capacities, to handle the enormous amount of data processed during training. That alone is an immediate reality check against the idea that only a handful of GPUs were involved.

Why Group Relative Policy Optimization (GRPO)?

Section 3.1 (“Reinforcement Learning Foundations”) of the paper details the motivation behind GRPO: a variant of Proximal Policy Optimization (PPO) that eliminates the need for a value function model. Instead, GRPO uses group-based advantage estimation:

Multiple Outputs per Prompt — For each prompt, DeepSeek generates several candidate responses.

Collective Reward Scoring — Instead of scoring each response in isolation, GRPO calculates the average reward of the entire group. Each solution’s “advantage” is how much it surpasses (or falls short of) that baseline.

KL-Divergence in the Loss — Rather than merging KL into the reward signal, GRPO includes it directly in its objective function to keep updates stable and prevent catastrophic policy drift.

By removing the value network, the paper notes savings in both computation and GPU memory — this is particularly relevant for large models, where memory usage can balloon if you’re simultaneously training the base model, a value network, and separate reward models.

Multi-Stage Training: From R1-Zero to the Final R1

A core finding in Section 4 (“Progressive Training Pipeline”) is that a hybrid approach — alternating between supervised fine-tuning (SFT) and reinforcement learning — often yields better results than pure RL. Here’s a breakdown of their four-stage pipeline, weaving in extra detail from the DeepSeek paper:

a) Stage 1: Base to SFT

Overview: They collected chain-of-thought (CoT) data, sometimes up to 10k tokens long, from a combination of human annotators and an initial RL model (dubbed R1-zero).

Purpose: Fine-tuning the base model with these refined samples helps the model adopt better language consistency and clarity before the RL steps even begin in earnest.

b) Stage 2: RL for Reasoning

Reward Functions: The authors highlight “rule-based” signals for correctness, language uniformity, and format. For example, math solutions might be rewarded based on matching the final answer or verifying that code can compile and run.

Key Insight: This RL pass significantly elevated the model’s ability to chain multiple reasoning steps together, as showcased in Section 5.1 with improvements on math benchmarks.

c) Stage 3: Rejection Sampling + SFT

Data Generation: The RL-enhanced model generates a massive synthetic dataset of potential responses — both for reasoning tasks (e.g., math, coding) and for broader tasks (e.g., role-play, general writing).

Filtering (or “Rejection”): A separate DeepSeek V3 “judge” model discards subpar outputs. Only the top-performing responses feed into the next SFT round, ensuring the model “learns from the best of itself.”

Paper’s Note: The authors emphasize how “automated filtering” can scale up high-quality dataset creation far beyond what human labelers alone can manage.

d) Stage 4: RL for Helpfulness

Combination of Reward Models: The final pass reintroduces RL with GRPO, but this time the focus is on overall helpfulness and safety (i.e., being “harmless”).

Outcome: The final R1 model can both reason deeply and remain coherent, user-friendly, and contextually appropriate across tasks.

On GPU Usage and Computational Footprint

One of the most frequent misconceptions is that DeepSeek required minimal hardware. Actually, Section 3.2 (“Implementation Details”) indicates that the team used large-scale GPU clusters for the multi-stage training. The paper goes into detail about how memory was allocated during RL steps, describing how the group-based advantage calculation allowed them to cut back on some overhead compared to classical PPO, but not to the point of only using “a few” GPUs.

In fact, the model’s parameter count, the length of its training sequences (up to 10k tokens in certain CoT examples), and the repeated sampling needed for RL steps imply a major computational investment. There’s mention of “distributed data parallel” setups and “accelerator optimization strategies,” all consistent with large HPC cluster usage.

Evaluation Metrics and Surprising Findings

The DeepSeek paper goes beyond raw training details to show off evaluation benchmarks in sections like Section 5.2 (“Mathematical Reasoning Evaluation”) and Section 6.1 (“General Benchmarking and Ablation Studies”). A few highlights:

Mathematical Gains: On challenging math tasks, DeepSeek R1’s pass@1 rate soared well above the base model’s, implying the RL steps truly helped the model think through multi-step solutions rather than guess.

Coding and Debugging: The model’s ability to generate functional code with minimal errors improved notably with repeated RL exposure. They also measured “test pass rates” for short coding problems, again showing a clear jump from earlier versions.

Effectiveness of Rule-Based Rewards: An interesting twist is that straightforward, rule-based rewards (e.g., verifying a final numeric answer or checking code compile logs) worked better than more complex, black-box reward models in many cases. This suggests that direct, verifiable criteria are especially potent for guiding LLM training.

The “Surprises and Insights” section (also part of Section 6) underscores two big takeaways:

No Need for MCTS or Process Reward Models

While there’s growing interest in advanced search techniques like Monte Carlo Tree Search (MCTS) or extremely granular step-by-step reward modeling, the DeepSeek team found that simpler methods, plus multi-stage refinement, achieved outstanding results.

Early SFT Stabilizes RL

Starting with SFT (rather than jumping straight to RL from the base) prevented the typical “chaotic convergence” that can happen when an unrefined model tries to optimize purely via RL signals. This dual approach saved considerable training time and reduced the risk of unproductive “mode collapse.”

So what does this mean?

DeepSeek’s journey to R1 is a case study in balancing raw computational power, clever algorithmic tweaks, and iterative data curation. Training a large model with complex RL signals can be tricky: too much RL from scratch, and you get bizarre outputs; too much supervised training, and you miss out on the dynamic advantage RL can offer. DeepSeek’s success came from interleaving these methods:

1. Start with SFT to align with basic language norms and maintain clarity.

2. Do focused RL to drastically improve reasoning in math, coding, etc.

3. Use rejection sampling to filter out and refine large volumes of synthetic data.

4. Finish with RL again, this time ensuring user-friendliness and adherence to content guidelines.

All of this, of course, was backed by a strong HPC setup — no small feat in terms of GPU allocations.

Lessons for Future LLM Training

By reading the DeepSeek paper closely, a few big themes emerge that future model developers can learn from:

Large-Scale Compute is Non-Negotiable

Even with efficient algorithms like GRPO, you’re not escaping the need for clusters of GPUs if you aim for top-tier performance in LLMs.

Please see my paper on this — here is the link to the LinkedIn newsletter article

Incremental, Multi-Stage Approaches Are Powerful

Hybridizing supervised fine-tuning and reinforcement learning can yield stable training and remarkable performance leaps.

Simplicity in Rewards Can Win

Don’t underestimate the value of straightforward, rule-based checks. They can often surpass more opaque reward models, especially for tasks where correctness is objective and easily defined (like math or coding).

Tailor RL to the Final Goal

If your end goal is a “helpful and harmless” assistant, reserve a final RL pass that specifically focuses on that. DeepSeek’s helpfulness emphasis in Stage 4 is a prime example.

In essence, DeepSeek R1’s success is about balancing advanced algorithmic insights with pragmatic engineering. The final model demonstrates that robust reasoning and user-friendly communication aren’t mutually exclusive — provided you structure your training pipeline thoughtfully, use the right reward signals, and, yes, have the necessary horsepower (a.k.a. more than “a few” GPUs!).

References to the DeepSeek Paper (Paraphrased):

Section 2 for Model Architecture & Pretraining

Section 3.1 for Reinforcement Learning (PPO vs. GRPO)

Section 3.2 for Implementation Details & GPU Usage

Section 4 for Progressive Training Pipeline (Stages 1–4)

Section 5.1 & 5.2 for Math Benchmarks & Evaluation Metrics

Section 6 & 6.1 for General Benchmarking, Ablation, and Surprise Findings

By combining these sections’ insights, you get the full story of how DeepSeek R1 rose to prominence — not through shortcuts or minimal hardware, but via a comprehensive, carefully orchestrated training strategy that harnessed the best of supervised learning and reinforcement learning.

Please let me know your thoughts in the comment section.

Search This Blog

Patterns that Connect: AI, Management, Metaverse, Quantum, Philosophy, and Physics

DeepSeek R1 — Breaking down the myth of minimal compute footprint

The DeepSeek Architecture and Scale

On GPU Usage and Computational Footprint

So what does this mean?

Lessons for Future LLM Training

References to the DeepSeek Paper (Paraphrased):

Comments

Post a Comment

Popular posts from this blog

Digital Selfhood

Axiomatic Thinking

How MSPs Can Deliver IT-as-a-Service with Better Governance