2025·Technical Deep Dive

DeepSeek V3's architectural innovations

Q: How does mixture-of-experts reduce training and inference costs?

Mixture-of-experts (MoE) architectures replace large monolithic layers with many small specialist networks (experts). For each input token, a router selects which experts to activate (typically 6 to 8 out of 256). This means DeepSeek V3 has 671 billion parameters but only 37 billion are active per token, making it as cheap to run as a 37B dense model while retaining the capacity of a much larger system.

Q: What is the $5.5 million training cost and is it accurate?

The $5.5 million figure represents GPU rental costs for the final training run: 2.788 million H800 GPU-hours at market rates. It excludes researcher salaries, failed experiments, data collection, and hardware depreciation if you own the GPUs. Real total cost is likely 2-5x higher, but the figure is still valid for comparing training efficiency across models using the same methodology.

Q: Why is multi-token prediction used during training but not inference?

Multi-token prediction trains the model to predict the next 2 to 4 tokens simultaneously, rather than only the immediate next token. This forces the model to plan ahead and improves stability. At inference time, the model still generates one token at a time because that's what users need, but the training-time lookahead makes each single-token prediction better informed.

Q: What is FP8 mixed precision training?

FP8 (8-bit floating point) uses less memory per number than standard FP32 (32-bit). DeepSeek V3 selectively applies FP8 to certain layers (attention projections, MLP intermediate states) while keeping critical components like gradients and optimizer states in higher precision. This reduces memory bandwidth requirements and speeds up training without sacrificing model quality.

Q: How did DeepSeek bypass NVIDIA's CUDA restrictions?

DeepSeek reportedly wrote low-level assembly code directly for GPU hardware, bypassing NVIDIA's CUDA framework. This was necessary because U.S. export restrictions block China from accessing the latest NVIDIA GPUs and software. The workaround demonstrates that CUDA lock-in is not absolute and that alternative GPU vendors (AMD, Huawei) become more viable when software can target hardware directly.

Q: Why doesn't DeepSeek release training data?

No frontier model provider releases full training data. Reasons include competitive advantage (data curation is valuable intellectual property), legal uncertainty (copyright status of scraped web content), and practical challenges (multi-terabyte datasets are expensive to host). DeepSeek's data opacity is standard practice, not an exception.

DeepSeek V3 demonstrated that a model competitive with GPT-4 can be trained for $5.5 million in compute costs, roughly 1/50th the estimated cost of training GPT-4. This was not achieved through a single breakthrough but through the systematic application of efficiency techniques developed by the research community over the past two years. The result is a 671-billion-parameter model that behaves, for inference purposes, like a 37-billion-parameter model.

The $5.5 million figure and what it actually means

The technical report states that DeepSeek V3 required 2.788 million H800 GPU-hours, trained across 2,048 H800 GPUs over 60 days. At market rental rates for cloud GPU instances ($2 per H800-hour), this equals $5.576 million. This number represents the cost of the final training run only.

It excludes researcher salaries, failed experiments, data collection and curation, infrastructure beyond the GPUs themselves, and the depreciation cost if you own the hardware rather than renting it. A realistic all-in cost is likely 2 to 5 times higher. The figure remains useful because it provides an apples-to-apples comparison across models using the same methodology.

For context, Meta's Llama 3 405B reportedly cost around $100 million to train. OpenAI has never disclosed GPT-4's training cost, but estimates place it in the $100-200 million range. DeepSeek's achievement is reducing this by two orders of magnitude while maintaining competitive performance. The question is how.

Mixture-of-experts: 671B parameters, 37B active

The core architectural decision is mixture-of-experts (MoE). Instead of a single large neural network that processes every token through every parameter, an MoE model consists of many small specialist networks (called experts (and a routing mechanism that selects which experts to activate for each token.

DeepSeek V3 has 256 experts per layer. For each token, the router activates 8 experts. This means that out of 671 billion total parameters, only 37 billion are active for any given computation. From an inference cost perspective, the model behaves like a 37B dense model. From a capacity perspective, it has access to 671B parameters worth of learned knowledge, because different tokens activate different experts.

The analogy: instead of one generalist doctor treating every patient, you have a hospital with 256 specialists. Each patient is routed to the 8 most relevant specialists based on their symptoms. The hospital has far more total expertise than any individual doctor, but the cost per patient is only the cost of consulting 8 specialists (not all 256).

Why MoE works: specialization through routing

The routing mechanism learns which experts are relevant for which inputs. In theory, you might expect one expert to specialize in mathematics, another in code generation, another in creative writing. In practice, the specialization is more subtle and not easily interpretable by humans.

What is clear from performance data is that the model learns to route effectively. Mistral's 8x7B model, an earlier MoE architecture, demonstrated this: 8 experts with 7B parameters each, activating 2 per token, achieved performance competitive with dense 70B models at 1/5 the inference cost. DeepSeek V3 extends this approach with more experts (256 vs 8) and more sophisticated routing.

The training challenge for MoE is load balancing. If the router learns to send most tokens to the same few experts, the other experts remain undertrained and the model degenerates to a dense network with wasted capacity. Early MoE implementations used auxiliary loss functions to penalize imbalanced routing, but this created optimization conflicts between "predict the next token accurately" and "use all experts equally."

Auxiliary-loss-free load balancing

DeepSeek's contribution here is an auxiliary-loss-free strategy. Instead of adding an explicit balancing objective to the loss function, they designed the router architecture itself to naturally encourage balance through mathematical constraints. Specifically, they added a penalty term directly in the routing softmax that makes overused experts harder to select.

The distinction matters because it simplifies training. A multi-objective loss function requires hyperparameter tuning to balance competing objectives: how much do we care about accuracy versus balance? An architectural solution that achieves balance without an explicit loss term removes this tuning problem and makes training more stable.

Multi-head latent attention for dimensionality reduction

Attention mechanisms in transformers project input tokens into query, key, and value representations, then compute attention scores. Standard attention uses the full model dimension for these projections. Multi-head latent attention (MLA) projects to a lower-dimensional latent space first, performs attention in that compressed space, then projects back.

This reduces the number of parameters in attention layers and decreases memory bandwidth requirements. For a model with 671B parameters, attention layers would otherwise dominate both parameter count and compute cost. By compressing the attention space, DeepSeek V3 achieves similar representational capacity with fewer resources.

The tradeoff is increased architectural complexity (more projection matrices, more intermediate activations), but the net effect is faster training and lower inference cost. This optimization was pioneered by other groups (notably Google's Perceiver architecture), but DeepSeek applied it systematically across a frontier-scale model.

FP8 mixed precision training: the memory bandwidth bottleneck

Standard neural network training uses 32-bit floating point numbers (FP32) for weights, activations, and gradients. This provides high numerical precision but consumes significant memory and memory bandwidth. Modern GPUs are often bandwidth-limited rather than compute-limited (they spend more time moving data between memory and processors than performing calculations.

Mixed precision training uses lower-precision formats for some operations. FP16 (16-bit) has been standard for years. DeepSeek V3 pushes this further with FP8 (8-bit floating point), which reduces memory bandwidth by 4x compared to FP32.

The challenge is that not all operations tolerate low precision. Gradients during backpropagation, for example, can become unstable if quantized too aggressively. DeepSeek's approach is selective: attention projections, MLP intermediate states, and expert activations use FP8, while optimizer states, gradients, and critical accumulation operations remain in higher precision.

This is not a post-training quantization technique where you compress a trained model. The model is trained in mixed precision from the start, which allows it to learn representations that are robust to quantization noise.

Multi-token prediction: training for lookahead

Standard language model training predicts one token at a time: given the sequence so far, predict the next token. DeepSeek V3 uses multi-token prediction during training: given the sequence so far, predict the next 2 to 4 tokens simultaneously.

This does not change inference behavior. The model still generates one token at a time when deployed. But training with multi-token prediction forces the model to plan ahead. To predict token N+2, the model cannot rely on first seeing token N+1; it must anticipate the likely trajectory of the sequence.

Empirically, multi-token prediction improves training stability and produces models that are better at long-range reasoning tasks. The computational cost is modest because the additional predictions share most of their computation with the primary prediction. They diverge only in the final output layer.

The technique originated in earlier work on speculative decoding, where a small model generates multiple token candidates and a large model verifies them. DeepSeek adapted the core idea for training rather than inference.

The CUDA bypass and NVIDIA's weakening moat

A notable subplot in DeepSeek's development is that the team reportedly wrote low-level GPU assembly code to bypass NVIDIA's CUDA framework. U.S. export restrictions block China from purchasing the latest NVIDIA GPUs and accessing the full CUDA ecosystem. Standard practice would be to use older hardware with official drivers.

Instead, DeepSeek wrote directly to the hardware instruction set, achieving performance closer to what would be possible with unrestricted access. This required significant engineering effort but demonstrates that CUDA lock-in, while strong, is not absolute.

The broader implication: as compute becomes the primary constraint on AI development, the ability to optimize at the hardware level becomes strategic. NVIDIA's moat consists of both superior chips and the software ecosystem that makes those chips easy to use. If Chinese labs can match or exceed CUDA performance through manual optimization, then NVIDIA's competitive advantage erodes and alternative GPU vendors (AMD, Huawei) become more viable.

Why training data remains undisclosed

DeepSeek's technical report describes architecture, training procedures, and performance benchmarks in detail. It does not disclose the training data. This is consistent with every frontier model: OpenAI, Anthropic, Google, and Meta all withhold training data composition.

The reasons are practical and legal. Data curation is valuable intellectual property. Knowing which datasets to include, how to weight them, and how to filter low-quality content is as important as model architecture. There is also legal uncertainty around copyright: training on copyrighted text scraped from the web may or may not constitute fair use, and no one wants to provide a detailed list of potentially infringing sources.

From a reproducibility standpoint, this is frustrating. You can replicate DeepSeek's architecture, but without the training data, you cannot verify that the same architecture trained on different data would achieve similar results. But data opacity is the industry norm, not a DeepSeek-specific issue.

What $5.5M training costs mean for accessibility

If a competitive frontier model costs $5.5 million to train, national research programs, large universities, and well-funded startups can afford it. That is a different accessibility profile than $100 million, which restricts development to the largest tech companies and governments.

Switzerland's annual research budget for the Swiss National Supercomputing Centre (CSCS) is approximately $50 million. Training a DeepSeek-scale model represents 10% of that budget. This is feasible. Training ten models to experiment with architectural variants is expensive but not prohibitive. Training a GPT-4-scale model at $100-200 million would consume multiple years of budget and is unlikely to be approved.

The cost reduction is not purely academic. It changes who can participate in frontier research. European research consortia, Asian universities, and even wealthy individuals can now train competitive models. The concentration of AI capability in five U.S. companies becomes less entrenched.

Open source as research accelerator and access enabler

DeepSeek released model weights and a detailed technical report but not training data. This is "open weights" rather than fully open source, but it is far more transparent than closed models. The immediate effect is that researchers worldwide can study the model, fine-tune it for specific domains, and build on its architecture.

Open releases create competitive pressure. When DeepSeek demonstrates that a $5.5M model can match GPT-4, other labs cannot justify training $100M models using outdated techniques. The efficiency frontier moves, and everyone benefits. OpenAI's rapid release of $200/month API access shortly after DeepSeek's announcement is no coincidence. Open releases force incumbents to compete on price and performance.

The counterargument is that open models accelerate capability development without corresponding safety oversight. A valid concern, but the alternative (concentrating all frontier research in a handful of companies that answer only to shareholders) creates its own risks. The open model ecosystem is messy, legally ambiguous, and sometimes misused. It is also the only mechanism preventing total privatization of the most important technology of the decade.

How does mixture-of-experts reduce training and inference costs?

Mixture-of-experts (MoE) architectures replace large monolithic layers with many small specialist networks (experts). For each input token, a router selects which experts to activate (typically 6 to 8 out of 256). This means DeepSeek V3 has 671 billion parameters but only 37 billion are active per token, making it as cheap to run as a 37B dense model while retaining the capacity of a much larger system.

What is the $5.5 million training cost and is it accurate?