2025·Technical Deep Dive

R1's reinforcement learning breakthrough and distillation strategy

Q: What is pure reinforcement learning for reasoning?

Pure RL for reasoning means training the model to solve problems through trial and error with only outcome-based rewards (correct/incorrect answers), without providing human-written examples of reasoning steps. The model must discover its own problem-solving strategies. OpenAI's o1 likely used process supervision (humans rating intermediate steps); DeepSeek R1 used only outcome supervision, which is cheaper to implement.

Q: How does distillation work for reasoning models?

Distillation trains a smaller student model to imitate a larger teacher model's behavior. For R1, the 671B parameter teacher generated reasoning traces (chains of thought), and a 32B Qwen student was trained to reproduce both the reasoning process and final answers. The student achieves 90% of teacher performance at 1/20th the parameter count, making reasoning capabilities deployable at much lower cost.

Q: Why does distillation preserve reasoning better than other capabilities?

Reasoning is more compressible than knowledge because it's procedural rather than factual. A small model cannot memorize all of Wikipedia, but it can learn the logical steps for solving a problem if those steps are made explicit through chain-of-thought traces. The teacher model externalizes its reasoning process, making it easier for the student to learn the strategy rather than just the answer.

Q: What are reasoning tokens versus output tokens?

Reasoning tokens are the model's internal scratchpad (intermediate steps shown during problem-solving but not part of the final answer. Output tokens are the user-facing response. R1 produces long reasoning traces (thousands of tokens for complex problems) before generating a concise answer. Users pay only for output tokens, but the model uses reasoning tokens to improve accuracy.

Q: How does R1's approach differ from OpenAI o1?

OpenAI o1 likely uses process supervision (human raters evaluate intermediate reasoning steps) and keeps the reasoning process hidden. DeepSeek R1 uses outcome supervision only (correct/incorrect final answers) and exposes reasoning traces publicly. Both achieve similar performance, but R1's approach requires less human annotation labor and provides transparency that enables distillation.

Q: What are the business implications for Swiss companies?

Distilled reasoning models like Qwen-R1 (32B) can run on local hardware at Swiss companies without sending data to external APIs. This addresses data residency requirements for healthcare, finance, and government while providing reasoning capabilities previously exclusive to cloud-based frontier models. The open release enables competitive alternatives to OpenAI's proprietary o1 API.

DeepSeek R1 demonstrated that reasoning capabilities can be trained through pure reinforcement learning without human-written examples of reasoning steps, then compressed into models small enough to run on consumer hardware. The 671-billion-parameter base model was distilled to a 32-billion-parameter Qwen student that retained 90% of the teacher's reasoning performance. This combination of training methodology and compression strategy changes the economics of deploying reasoning models.

Reinforcement learning without supervised reasoning examples

OpenAI's o1 model, released in September 2024, introduced extended chain-of-thought reasoning to production language models. The training methodology was never disclosed, but industry analysis suggested it relied on process supervision: human annotators rating each intermediate reasoning step in addition to the final answer. This provides fine-grained feedback but requires expensive human labor for every problem domain.

DeepSeek R1 used outcome-based reinforcement learning instead. The model receives a problem, generates a reasoning trace leading to an answer, and gets a binary reward: correct or incorrect. No human evaluates the intermediate steps. The model must discover its own reasoning strategies through trial and error. This approach is both conceptually simpler and cheaper to scale because correctness can be verified automatically for many domains (mathematics, coding, logic puzzles).

The tradeoff is training stability. Process supervision guides the model toward human-legible reasoning patterns. Outcome supervision allows the model to develop any strategy that produces correct answers, even if the reasoning path is unconventional or difficult to interpret. DeepSeek's technical report indicates that early training exhibited unstable behavior (the model would generate extremely long reasoning traces or collapse into repetitive loops (but these issues were resolved through reward shaping and training hyperparameter tuning.

Reward modeling and self-discovered reasoning strategies

The reward model for R1 is described as evaluating answer correctness, not reasoning quality. For mathematical problems, this means checking whether the final numerical answer matches the ground truth. For coding problems, it means running the generated code against test cases. For open-ended questions, it likely involves a separate verifier model trained to assess response quality, though the technical report does not specify details for this category.

What the model learns through this training is both how to reach correct answers and how to allocate reasoning compute. Simple problems receive short reasoning traces. Complex problems receive longer traces with backtracking, alternative approaches, and self-correction. The model develops an internal heuristic for when additional reasoning is likely to improve accuracy, which is the core skill that distinguishes reasoning models from standard language models.

This behavior emerges without explicit instruction. The model is not told "spend more time on hard problems." It learns through gradient descent that generating longer, more structured reasoning traces correlates with higher rewards on difficult problems. This is analogous to how humans learn to slow down and think carefully when a problem is not immediately obvious.

Why distillation works for reasoning capabilities

Model distillation trains a smaller student model to imitate a larger teacher's behavior. Typically, distillation compresses knowledge but sacrifices capability (a 32B student cannot replicate a 671B teacher's performance on complex tasks. Reasoning models are different because the reasoning process is externalized.

When R1 generates a chain-of-thought trace before answering, it makes its problem-solving strategy explicit. The student model can observe both the answer and the logical steps leading to that answer. This is more learnable than implicit knowledge embedded in model weights. A small model cannot memorize the entirety of Wikipedia, but it can learn the procedural steps for solving a calculus problem if those steps are demonstrated.

DeepSeek distilled R1 to Qwen 32B, a pre-existing open source language model. The distillation training set consisted of R1's reasoning traces for thousands of problems across mathematics, coding, and general reasoning benchmarks. The student learned to generate similar reasoning patterns, achieving scores within 10% of the teacher on most benchmarks.

The practical impact: a 32B model requires roughly 64GB of GPU memory with standard precision, making it deployable on a single high-end consumer GPU or a modest cloud instance. The 671B R1 model requires distributed infrastructure across multiple GPUs. Distillation makes reasoning capabilities accessible to organizations that cannot afford frontier-scale deployment.

Reasoning tokens versus output tokens and the cost structure

R1 operates in two phases: reasoning and response. The reasoning phase generates internal tokens (often thousands for a complex problem (that are hidden from the end user in OpenAI's o1 but exposed in DeepSeek's release. The response phase generates the user-facing answer, typically a few hundred tokens. API pricing charges only for output tokens, not reasoning tokens, which makes the extended inference cost invisible to users.

This pricing structure is unsustainable if reasoning tokens scale linearly with problem difficulty. A sufficiently hard problem could trigger reasoning traces of 100,000 tokens, consuming inference compute equivalent to dozens of standard queries. DeepSeek's approach is to expose reasoning tokens, allowing users to see the compute cost and potentially cache reasoning traces for repeated similar problems.

The distilled models change the economics again. A 32B model generating 10,000 reasoning tokens is cheaper to run than a 671B model generating 1,000 reasoning tokens, even though the smaller model produces longer traces. This inverts the usual relationship where smaller models must compensate for reduced capability by generating more output.

Open source reasoning models and competitive dynamics

OpenAI positioned o1 as a major capability leap and priced it as a premium API tier. DeepSeek released R1 with full model weights and a distilled version under a permissive license. Any organization can now deploy reasoning capabilities locally without paying per-query costs or sending data to external APIs.

The immediate competitive response was visible. Google DeepMind announced Gemini 2.0 with extended reasoning. Anthropic expanded Claude's thinking mode. OpenAI reduced o1 API pricing and released o1-mini at a lower price point. The release of open source reasoning models forced incumbents to compete on price and accessibility rather than maintaining exclusive capability.

For Swiss companies with data residency requirements, this is strategically significant. A hospital analyzing patient data or a bank processing transactions cannot send that information to U.S. cloud providers. A locally hosted Qwen-R1 instance provides reasoning capabilities while keeping data on-premises. The open release enables compliance with Swiss data protection regulations without sacrificing model performance.

Geopolitical implications and the CUDA bypass context

DeepSeek's development occurred under U.S. export restrictions blocking China from accessing the latest NVIDIA GPUs and software. The team reportedly bypassed NVIDIA's CUDA framework by writing low-level GPU assembly code, achieving performance closer to unrestricted hardware access. This technical workaround demonstrated that software barriers are not absolute when sufficient engineering effort is applied.

The R1 release amplifies this dynamic. The U.S. strategy of limiting Chinese AI capability through hardware restrictions assumes that frontier model development requires exclusive access to cutting-edge chips. If Chinese labs can train competitive models on older hardware through algorithmic efficiency, and then distill those models to sizes that run on consumer hardware, the hardware export strategy becomes less effective.

The broader implication is that AI capability is becoming harder to contain through supply chain control. Model weights are digital files that copy instantly. Training techniques are published in academic papers. Open source releases like R1 ensure that reasoning capabilities are globally accessible regardless of hardware availability. This does not eliminate the advantage of better hardware (it means the advantage is smaller than previously assumed.

Limitations of pure outcome-based RL

R1's outcome supervision has structural limitations. It works well for domains with verifiable correct answers: mathematics, coding, logic puzzles, factual questions with objective answers. It does not work for tasks where correctness is subjective or context-dependent: creative writing, strategic advice, ethical reasoning, or any domain where multiple valid answers exist.

For these tasks, process supervision or human preference modeling is necessary. This is where OpenAI's approach likely remains stronger. If o1 uses process supervision, it can learn reasoning patterns for open-ended problems where outcome supervision provides no signal. The technical report for R1 does not detail performance on subjective tasks, suggesting this is a known weakness.

The practical takeaway for deployment: R1 and its distilled variants are strongest in technical domains with objective correctness criteria. They are suitable for code generation, mathematical problem-solving, data analysis, and factual question answering. They are less suitable for tasks requiring nuanced judgment, cultural sensitivity, or creative synthesis where human feedback on reasoning quality is essential.

What distillation means for model accessibility

The broader trend: reasoning is compressible in ways that raw knowledge is not. A frontier model trained on trillions of tokens develops both broad knowledge and problem-solving strategies. Distillation preserves more of the strategies than the knowledge, because strategies can be demonstrated explicitly through reasoning traces while knowledge must be memorized.

This creates a two-tier model ecosystem. Large frontier models serve as teachers, maintaining comprehensive knowledge and developing new capabilities. Small distilled models serve as deployment targets, providing specialized reasoning for specific domains at lower cost. The teacher models remain expensive to train and operate. The student models become commodity infrastructure.

For Switzerland's AI strategy, this suggests investing in deployment infrastructure for distilled models rather than attempting to train frontier models domestically. A national compute initiative could provide subsidized access to GPUs for running Qwen-R1 or similar open source reasoning models, enabling universities, hospitals, and government agencies to use reasoning capabilities without dependence on foreign API providers. Training a competitive frontier model would cost millions and require ongoing investment to remain current. Deploying distilled models costs thousands and benefits from global open source development.

What is pure reinforcement learning for reasoning?

Pure RL for reasoning means training the model to solve problems through trial and error with only outcome-based rewards (correct/incorrect answers), without providing human-written examples of reasoning steps. The model must discover its own problem-solving strategies. OpenAI's o1 likely used process supervision (humans rating intermediate steps); DeepSeek R1 used only outcome supervision, which is cheaper to implement.

How does distillation work for reasoning models?

Distillation trains a smaller student model to imitate a larger teacher model's behavior. For R1, the 671B parameter teacher generated reasoning traces (chains of thought), and a 32B Qwen student was trained to reproduce both the reasoning process and final answers. The student achieves 90% of teacher performance at 1/20th the parameter count, making reasoning capabilities deployable at much lower cost.

Why does distillation preserve reasoning better than other capabilities?

Reasoning is more compressible than knowledge because it's procedural rather than factual. A small model cannot memorize all of Wikipedia, but it can learn the logical steps for solving a problem if those steps are made explicit through chain-of-thought traces. The teacher model externalizes its reasoning process, making it easier for the student to learn the strategy rather than just the answer.

What are reasoning tokens versus output tokens?