[原文]Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y.X. Wei DeepSeek-AI Beijing China (2025) Abstract. The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
[原文]Symposium on Computer Architecture (ISCA ’25), June 21–25, 2025, Tokyo, Japan † † doi: 10.1145/3695053.3731412 † † isbn: 979-8-4007-1261-6/2025/06 † † ccs: Computer systems organization Architectures † † footnotetext: Authors are listed in alphabetical order of their first names. Yuqing Wang and Liyue Zhang are the corresponding authors of this paper (e-mail: research@deepseek.com). 1. Introduction 1.1. Background Large Language Models (LLMs) have undergone rapid evolution in recent years, driven by iterative advancements in model design, computational power, and data availability. In 2024, gr...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
[原文]meet these challenges, industry leaders such as Alibaba, ByteDance, Google, xAI and Meta have deployed colossal training clusters (Jouppi et al., 2023 ; Mudigere et al., 2023 ; Gangidi et al., 2024 ; Jiang et al., 2024 ; Qian et al., 2024 ; xAI, 2024b ) , featuring tens or even hundreds of thousands of GPUs or TPUs. While such massive infrastructures have enabled the development of state-of-the-art models, their exorbitant costs present significant barriers for smaller research teams and organizations. Despite these barriers, open-source startups such as DeepSeek (DeepSeek-AI, 2024b , c , d , ...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
[原文]aling LLMs efficiently without sacrificing performance or accessibility. Specifically, the paper focuses on: • Hardware-Driven Model Design: Analyze how hardware features, such as FP8 low-precision computation and scale-up/scale-out network properties, informed the architectural choices in DeepSeek-V3. • Mutual Dependencies Between Hardware and Models: Investigate how hardware capabilities shape model innovation and how the evolving demands of LLMs drive the need for next-generation hardware. • Future Directions for Hardware Development: Derive actionable insights from DeepSeek-V3 to guide the...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
[原文]ts in BF16. 2. Design Principles for DeepSeek Models The development of DeepSeek-V3 exemplifies a hardware-aware approach to scaling LLMs, where each design decision was carefully aligned with hardware constraints to optimize performance and cost efficiency. As shown in Figure 1 , DeepSeek-V3 employs the DeepSeekMoE (DeepSeek-AI, 2024e ) and Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c ) architectures that have been proven effective in DeepSeek-V2 (DeepSeek-AI, 2024c ) . DeepSeekMoE unlocks the potential of MoE architecture, while MLA drastically reduces memory consumption by compress...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
[原文]eviating the AI memory wall challenge. A detailed discussion of low-precision techniques is provided in Section 3 Low-Precision Driven Design. 2.1.2. Reducing KV Cache with MLA For LLM inference, user requests often involve multi-turn conversations. To handle these efficiently, the context from previous requests is cached in what is commonly referred to as the KV cache . KV cache addresses this challenge by caching the Key and Value vectors of previously processed tokens, eliminating the need to recompute them for subsequent tokens. During each inference step, the model only computes the Key a...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
共享单组 KV 对,显著压缩 KV 存储。代表性方法包括 GQA(Ainslie et al., 2023)和 MQA(Shazeer, 2019)。• 窗口化 KV:对于长序列,缓存中仅保留 KV 对的滑动窗口,丢弃窗口外的结果。虽然这降低了存储需求,但会损害长上下文推理能力。代表性方法包括 Longformer(Beltagy et al., 2020)及相关架构。• 量化压缩:KV 对以低位表示存储(Hooper et al., 2024;Liu et al., 2024;Kang et al., 2024),进一步 red……
[原文]share a single set of KV pairs, significantly compressing KV storage. Representative methods include GQA (Ainslie et al., 2023 ) and MQA (Shazeer, 2019 ) . • Windowed KV: For long sequences, only a sliding window of KV pairs is retained in the cache, discarding results outside the window. While this reduces storage, it compromises long-context reasoning. Representative methods include Longformer (Beltagy et al., 2020 ) and related architectures. • Quantized Compression: KV pairs are stored using low-bit representations (Hooper et al., 2024 ; Liu et al., 2024 ; Kang et al., 2024 ) , further red...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
[原文]attention (Yuan et al., 2025 ) , which seek to compress and sparsely activate attention keys and values, represent another attempt at overcoming the computational challenges associated with attention. We look forward to collaborative progress with the broader community toward breakthroughs in this area. 2.2. Cost-Effectiveness of MoE Models For sparse computing, we have developed DeepSeekMoE, an advanced Mixture of Experts (MoE) architecture, which is illustrated in the lower right part of Figure 1 . The advantages of MoE models lie in two folds. 2.2.1. Reducing Computational Requirements for ...
模型,其高昂的成本为小型研究团队和机构构成了显著壁垒。尽管面临这些障碍,DeepSeek(DeepSeek-AI, 2024b , c , d , a , 2025a )和 Mistral(Jiang et al., 2023 ; Mistral, 2024 )等开源初创公司仍在致力于开发最先进的模型。其中,DeepSeek 尤为突出地证明了,有效的软硬件协同设计能够实现大模型的成本高效训练,从而为小型团队拉平了竞争差距。秉承这一传统,DeepSeek-V3(DeepSeek-AI, 2024d )在成本……方面树立了新的里程碑。
[原文]odels, their exorbitant costs present significant barriers for smaller research teams and organizations. Despite these barriers, open-source startups such as DeepSeek (DeepSeek-AI, 2024b , c , d , a , 2025a ) and Mistral (Jiang et al., 2023 ; Mistral, 2024 ) are also striving to develop state-of-the-art models. Among them, DeepSeek has especially demonstrated that effective software-hardware co-design can enable cost-efficient training of large models, leveling the playing field for smaller teams. Building on this tradition, DeepSeek-V3 (DeepSeek-AI, 2024d ) represents a new milestone in cost-...
[原文]l innovation and how the evolving demands of LLMs drive the need for next-generation hardware. • Future Directions for Hardware Development: Derive actionable insights from DeepSeek-V3 to guide the co-design of future hardware and model architectures, paving the way for scalable, cost-efficient AI systems. 1.3. Structure of this Paper The remainder of this paper is organized as follows. Section 2 explores the design principles underpinning DeepSeek-V3 model architecture, highlighting key innovations such as Multi-head Latent Attention, Mixture-of-Experts optimizations and Multi-Token Predictio...
1.1. Background
大型语言模型(LLM)近年来经历了快速演进,这得益于模型设计、计算能力与数据可用性的迭代进步。2024 年,GPT-4o(OpenAI, 2024a)、LLaMa-3(AI@Meta, 2024a)、Claude 3.5 Sonnet(Anthropic, 2024)、Grok-2(xAI, 2024a)、Qwen2.5(Yang et al., 2024)、Gemini-2(Google, 2024)以及我们的 DeepSeek-V3(DeepSeek-AI, 2024d)等突破性模型展示了显著进展,进一步缩小了与通用人工智能(AGI)之间的差距。正如 Scaling Laws(Kaplan et al., 2020)所示,增大模型 s……
[原文]Large Language Models (LLMs) have undergone rapid evolution in recent years, driven by iterative advancements in model design, computational power, and data availability. In 2024, groundbreaking models such as GPT4o (OpenAI, 2024a ) , LLaMa-3 (AI@Meta, 2024a ) , Claude 3.5 Sonnet (Anthropic, 2024 ) , Grok-2 (xAI, 2024a ) , Qwen2.5 (Yang et al., 2024 ) , Gemini-2 (Google, 2024 ) and our DeepSeek-V3 (DeepSeek-AI, 2024d ) have showcased remarkable progress, further narrowing the gap towards Artificial General Intelligence (AGI). As the Scaling Laws (Kaplan et al., 2020 ) shows, increasing model s...
[原文]rbitant costs present significant barriers for smaller research teams and organizations. Despite these barriers, open-source startups such as DeepSeek (DeepSeek-AI, 2024b , c , d , a , 2025a ) and Mistral (Jiang et al., 2023 ; Mistral, 2024 ) are also striving to develop state-of-the-art models. Among them, DeepSeek has especially demonstrated that effective software-hardware co-design can enable cost-efficient training of large models, leveling the playing field for smaller teams. Building on this tradition, DeepSeek-V3 (DeepSeek-AI, 2024d ) represents a new milestone in cost-effective traini...
[原文]This paper does not aim to reiterate the detailed architectural and algorithmic specifics of DeepSeek-V3, which are extensively documented in its technical report (DeepSeek-AI, 2024d ) . Instead, it adopts a dual perspective—spanning hardware architecture and model design—to explore the intricate interplay between them in achieving cost-efficient large-scale training and inference. By examining this synergy, we aim to provide actionable insights for scaling LLMs efficiently without sacrificing performance or accessibility. Specifically, the paper focuses on: • Hardware-Driven Model Design: Ana...
[原文]The remainder of this paper is organized as follows. Section 2 explores the design principles underpinning DeepSeek-V3 model architecture, highlighting key innovations such as Multi-head Latent Attention, Mixture-of-Experts optimizations and Multi-Token Prediction Module. Section 3 illustrates how our model architecture pursues low-precision computation and communication. Section 4 includes scale-up interconnection optimizations, discusses scale-up/scale-out convergence, and explores how hardware features influence parallelism and expert selection strategies. Section 5 focuses on scale-out net...
[原文]The development of DeepSeek-V3 exemplifies a hardware-aware approach to scaling LLMs, where each design decision was carefully aligned with hardware constraints to optimize performance and cost efficiency. As shown in Figure 1 , DeepSeek-V3 employs the DeepSeekMoE (DeepSeek-AI, 2024e ) and Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c ) architectures that have been proven effective in DeepSeek-V2 (DeepSeek-AI, 2024c ) . DeepSeekMoE unlocks the potential of MoE architecture, while MLA drastically reduces memory consumption by compressing Key-Value (KV) caches. In addition, DeepSeek-V3 i...
[原文]cussion of low-precision techniques is provided in Section 3 Low-Precision Driven Design. 2.1.2. Reducing KV Cache with MLA For LLM inference, user requests often involve multi-turn conversations. To handle these efficiently, the context from previous requests is cached in what is commonly referred to as the KV cache . KV cache addresses this challenge by caching the Key and Value vectors of previously processed tokens, eliminating the need to recompute them for subsequent tokens. During each inference step, the model only computes the Key and Value vectors for the current token and performs a...
[原文]sing KV storage. Representative methods include GQA (Ainslie et al., 2023 ) and MQA (Shazeer, 2019 ) . • Windowed KV: For long sequences, only a sliding window of KV pairs is retained in the cache, discarding results outside the window. While this reduces storage, it compromises long-context reasoning. Representative methods include Longformer (Beltagy et al., 2020 ) and related architectures. • Quantized Compression: KV pairs are stored using low-bit representations (Hooper et al., 2024 ; Liu et al., 2024 ; Kang et al., 2024 ) , further reducing memory usage. Quantization achieves significant...
[原文]ess and sparsely activate attention keys and values, represent another attempt at overcoming the computational challenges associated with attention. We look forward to collaborative progress with the broader community toward breakthroughs in this area. 2.2. Cost-Effectiveness of MoE Models For sparse computing, we have developed DeepSeekMoE, an advanced Mixture of Experts (MoE) architecture, which is illustrated in the lower right part of Figure 1 . The advantages of MoE models lie in two folds. 2.2.1. Reducing Computational Requirements for Training The primary advantage of the MoE architectu...
[原文]unique advantages in single-request scenarios. Because only a subset of parameters is activated per request, memory and computational demands are greatly reduced. For example, DeepSeek-V2 (236B parameters) activates just 21B parameters during inference. This enables PCs with AI SoC chips (Apple, 2024 ; NVIDIA, 2025 ; AMD, 2025 ) to achieve nearly 20 tokens per second (TPS), or even twice that speed, which is more than sufficient for personal use. In contrast, dense models of similar capability (e.g., 70B parameters) typically reach only single-digit TPS on similar hardware. Notably, the increa...
[原文]communication step. This pipelined approach enables seamless overlap of all-to-all communication with ongoing computation, ensuring that the GPU remains fully utilized at all times. Moreover, in production, we adopt a prefill and decode disaggregation architecture (Zhong et al., 2024 ) , assigning large batch size prefill and latency-sensitive decode requests to different expert parallelism group sizes. This strategy ultimately maximizes system throughput under real-world service conditions. Table 2. Comparison of computational costs for training MoE and dense models: Computational cost per to...
[原文]hat each device processes an equal batch size during expert parallelism, allowing the communication time to be easily calculated. For a system interconnected with CX7 400Gbps InfiniBand (IB) NICs, the time required for the two all-to-all communications in EP is calculated as follows: Comm. Time = ( 1 Byte + 2 Bytes ) × 32 × 9 × 7 K / 50 GB/s = 120.96 μ s \text{Comm. Time}=(1\text{Byte}+2\text{Bytes})\times 32\times 9\times 7\text{K}/50\text{GB/s}=120.96\mu s Here, dispatch uses FP8 (1 byte), while combine uses BF16 (2 bytes), and the hidden size of each token is approximately 7K. T...
[原文]ep drops to: Comm. Time = ( 1 Byte + 2 Bytes ) × 32 × 9 × 7 K / 900 GB/s = 6.72 μ s \text{Comm. Time}=(1\text{Byte}+2\text{Bytes})\times 32\times 9\times 7\text{K}/900\text{GB/s}=6.72\mu s Assuming the computation time is equal to the communication time, this reduces the total inference time significantly, enabling a theoretical upper limit of over 0.82 ms TPOT , approximately 1200 tokens per second . While this figure is purely theoretical and has not been empirically validated, it vividly illustrates the transformative potential of high-bandwidth scale-up networks in accelerating...
[原文]oken, which increases the generation TPS by 1.8x compared to the scenario without the MTP module. Moreover, by predicting multiple tokens per step, MTP increases the inference batch size, which is crucial for boosting EP computational intensity and hardware utilization. Such algorithmic innovations are vital for fast and cost-effective inference in DeepSeek-V3. 2.3.4. High Inference Speed for Reasoning Models and Test-Time Scaling Test-time scaling in LLMs, exemplified by OpenAI’s o1/o3 series (OpenAI, 2024b , 2025 ) , has enabled significant advances in mathematical reasoning, programming, an...
[原文]P8 mixed-precision computation, and network co-designed MoE gate routing. Given the prohibitive cost of exhaustive ablation on full-scale models, we adopt a hierarchical and resource-efficient validation pipeline. Each technique is first validated extensively on small-scale models, followed by minimal large-scale tuning, and finally integrated in a single, comprehensive training run. For instance, we first conducted fine-grained FP8 training ablation studies on both 16B and 230B DeepSeek-V2 models before final integration. Under these controlled settings, the relative accuracy loss compared to...
[原文]LLMs generally require significant memory resources, with memory demands increasing by more than 1000% per year. In contrast, the growth rate of high-speed memory (e.g., HBM) capacity is much slower, typically less than 50% per year (Gholami et al., 2024 ) . While multi-node parallelism is a viable solution to address memory limitations, optimizing memory usage at the source remains a crucial and effective strategy. 2.1.1. Low-Precision Models Compared to models that utilize BF16 for weights, FP8 significantly reduces memory consumption by half, effectively alleviating the AI memory wall chall...
[原文]trained with the model. During inference, only the latent vector needs to be cached, significantly reducing memory consumption compared to storing the KV cache for all attention heads. In addition to MLA, several other approaches have been proposed to reduce the size of the KV cache. These methods are highly valuable and provide significant inspiration for advancements in memory-efficient attention mechanisms: • Shared KV (Grouped-Query Attention, GQA; Multi-Query Attention, MQA): Instead of maintaining separate KV pairs for each attention head, multiple heads share a single set of KV pairs, s...
[原文]ons and Perspectives on Resource-Efficient Techniques While reducing the size of the KV cache is a promising method for improving memory efficiency, the quadratic complexity inherent in Transformer-based autoregressive decoding remains a formidable challenge, especially for extremely long contexts. Recent research efforts, such as Mamba-2 (Dao and Gu, 2024 ) and Lightning Attention (Qin et al., 2024 ) , investigate linear-time alternatives that offer new possibilities for balancing computational cost and model performance. In addition, approaches such as sparse attention (Yuan et al., 2025 ) ,...
[原文]For sparse computing, we have developed DeepSeekMoE, an advanced Mixture of Experts (MoE) architecture, which is illustrated in the lower right part of Figure 1 . The advantages of MoE models lie in two folds. 2.2.1. Reducing Computational Requirements for Training The primary advantage of the MoE architecture lies in its ability to significantly reduce training costs. By selectively activating only a subset of expert parameters, MoE models allow the total parameter count to scale up dramatically while keeping computational requirements modest. For example, DeepSeek-V2 features 236B parameters...
[原文]Apple, 2024 ; NVIDIA, 2025 ; AMD, 2025 ) to achieve nearly 20 tokens per second (TPS), or even twice that speed, which is more than sufficient for personal use. In contrast, dense models of similar capability (e.g., 70B parameters) typically reach only single-digit TPS on similar hardware. Notably, the increasingly popular KTransformers (group and Approaching.AI, 2025 ) inference engine allows the complete DeepSeek-V3 model to run on a low-cost server equipped with a consumer GPU (costing approximately $10,000), while still achieving nearly 20 TPS. This efficiency makes MoE architectures suita...
[原文]2.3.1. Overlapping Computation and Communication: Maximizing Throughput Inference speed encompasses both system-wide maximum throughput and single-request latency. To maximize throughput, our model is architected from the outset to leverage dual micro-batch overlap (DeepSeek-AI, 2025d ; Zhao et al., 2025b ) , intentionally overlapping communication latency with computation. As demonstrated in our online inference system and supported by open-source profiling data (DeepSeek-AI, 2025d ) , we decouple the computation of MLA and MoE into two distinct stages. While one micro-batch executes a portio...
[原文]enhance their intelligence. For MoE models, achieving high inference speed relies on efficiently deploying expert parameters across computing devices. To achieve the fastest possible inference speed, each device should ideally perform computations for a single expert (or multiple devices should collaboratively compute a single expert if necessary). However, Expert Parallelism (EP) requires routing tokens to the appropriate devices, which involves all-to-all communication across the network. As a result, the upper limit of MoE inference speed is dictated by interconnection bandwidth. Consider a...
[原文]dual micro-batch overlap. Under this assumption, the total time per layer can be formulated as: Total Time Per Layer = 2 × 120.96 μ s = 241.92 μ s \text{Total Time Per Layer}=2\times 120.96\mu s=241.92\mu s With 61 layers in DeepSeek-V3, the total inference time is: Total Inference Time = 61 × 241.92 μ s = 14.76 ms \text{Total Inference Time}=61\times 241.92\mu s=14.76\text{ms} Thus, the theoretical upper limit for this system is approximately 14.76 ms TPOT , equivalent to 67 tokens per second . However, in practice, factors such as communication overhead, latency, incomplete ban...
[原文]equential bottlenecks. MTP mitigates this issue by enabling the model to generate additional candidate tokens at a lower cost and verify them in parallel, similar to previous self-drafting-based speculative decoding approaches (Cai et al., 2024 ; Li et al., 2024 ) . This framework significantly accelerates inference without compromising accuracy. As illustrated in the top part of Figure 1 , each MTP module uses a single layer, which is much more lightweight than the full model, to predict additional tokens, enabling parallel verification of multiple candidate tokens. Although slightly hurting ...
[原文]l., 2024 ) —the necessity to rapidly generate large numbers of samples makes inference throughput a critical bottleneck. Likewise, prolonged reasoning sequences can increase user wait times, reducing the practical usability of such models. As a result, optimizing inference speed through synergistic hardware and software innovations is indispensable for advancing the efficiency of reasoning models. However, effective strategies for accelerating inference and expediting RL training remain active areas of investigation, as discussed in Section 2.1.3 . We encourage the broader community to collabo...
[原文]Each acceleration technique undergoes rigorous empirical validation to evaluate its accuracy impact, including MLA, FP8 mixed-precision computation, and network co-designed MoE gate routing. Given the prohibitive cost of exhaustive ablation on full-scale models, we adopt a hierarchical and resource-efficient validation pipeline. Each technique is first validated extensively on small-scale models, followed by minimal large-scale tuning, and finally integrated in a single, comprehensive training run. For instance, we first conducted fine-grained FP8 training ablation studies on both 16B and 230B...
[原文]3.1. FP8 Mix-Precision Training Quantization techniques such as GPTQ (Frantar et al., 2022 ) and AWQ (Lin et al., 2024 ) have been widely used to reduce bit-widths to 8-bit, 4-bit, or even lower, significantly reducing memory requirements. However, these techniques are primarily applied during inference to save memory, rather than in the training phase. NVIDIA’s Transformer Engine has supported FP8 mixed-precision training for some time, but prior to DeepSeek-V3, there were no open-source large models leveraging FP8 for training. Through deep collaboration between our infrastructure and algori...
[原文]large dequantization overhead in transporting the partial results from Tensor Cores to CUDA Cores for scaling factor multiplication. This incurs frequent data movements, reducing computational efficiency and complicating hardware utilization. 3.1.2. Suggestions: To address the limitations of existing hardware, we have the following suggestions for future designs: • Increased Accumulation Precision: Hardware should improve the accumulation register precision to an appropriate value (e.g. FP32), or support a configurable accumulation precision, enabling a trade-off between performance and accura...
[原文]the number of bits with the leading 1 bit as the sign bit S S . By mapping the activations from the original Linear space to the Log space, the distribution of the activations is more uniform. To be specific, given a tile of elements, [ x 1 , ⋯ , x m ] [x_{1},\cdots,x_{m}] , which is 1x128 in our implementation, we take the absolute values and compute the logarithm of all the elements, and find the minimum m i n = l o g ( a b s ( x i ) ) min=log(abs(x_{i})) and maximum m a x = l o g ( a b s ( x j ) ) max=log(abs(x_{j})) . The minimum is encoded as S .00 ⋯ ...
[原文]Quantization techniques such as GPTQ (Frantar et al., 2022 ) and AWQ (Lin et al., 2024 ) have been widely used to reduce bit-widths to 8-bit, 4-bit, or even lower, significantly reducing memory requirements. However, these techniques are primarily applied during inference to save memory, rather than in the training phase. NVIDIA’s Transformer Engine has supported FP8 mixed-precision training for some time, but prior to DeepSeek-V3, there were no open-source large models leveraging FP8 for training. Through deep collaboration between our infrastructure and algorithm teams, and after extensive e...
[原文]n transporting the partial results from Tensor Cores to CUDA Cores for scaling factor multiplication. This incurs frequent data movements, reducing computational efficiency and complicating hardware utilization. 3.1.2. Suggestions: To address the limitations of existing hardware, we have the following suggestions for future designs: • Increased Accumulation Precision: Hardware should improve the accumulation register precision to an appropriate value (e.g. FP32), or support a configurable accumulation precision, enabling a trade-off between performance and accuracy for different requirements o...
[原文]In the current DeepSeek-V3 architecture, we employ low-precision compression for network communication. During EP parallelism, tokens are dispatched using fine-grained FP8 quantization, reducing communication volume by 50% compared to BF16. This significantly lowers communication time. While the combine stage still uses higher precision (e.g., BF16) due to accuracy requirements, we are actively testing FP8, custom precision formats (e.g., E5M6) and mixing FP8-BF16 for further reductions. Besides these traditional floating point formats, we also tried a new data type, named Logarithmic Floating...
[原文]ortant to round in the original Linear space, instead of the Log space, for the unbiased activation quantization. We also constrain the m i n min to be larger than m a x − l o g ( 2 32 ) max-log(2^{32}) , which means that the max representation range is similar to E5, a floating point with 5 exponents. We validate our LogFMT-nBit on dense language models with around 7 billion parameters, by quantifying the output of the residual branch to simulate the combine stage in MoE models. When setting n = 8 n=8 , sharing the same bits with FP8, the LogFMT-8Bit shows superior training accu...
[原文]4.1. Current Hardware Architecture The NVIDIA H800 GPU SXM architecture we currently use, illustrated in Figure 2 , is built on the Hopper architecture, similar to the H100 GPU. However, it features reduced FP64 computational performance and NVLink bandwidth for regulatory compliance. Specifically, the NVLink bandwidth in H800 SXM nodes is reduced from 900 GB/s to 400 GB/s. This significant reduction in intra-node scale-up bandwidth presents a challenge for high-performance workloads. To compensate, each node is equipped with eight 400G Infiniband (IB) CX7 NICs, enhancing scale-out capabilitie...
[原文]between scale-up (intra-node) and scale-out (inter-node) communication in the H800 architecture is approximately 4:1. Specifically, NVLink provides 200GB/s bandwidth (of which about 160GB/s can actually be achieved), while each 400Gbps IB NIC delivers only 50GB/s bandwidth (we consider small message size and latency influence, use 40GB/s for effective bandwidth). To balance and fully utilize the higher intra-node bandwidth, the model architecture is co-designed with hardware, particularly in the TopK Expert Selection Strategy . Consider a setup with 8 nodes (64 GPUs in total) and 256 routed ex...
[原文]bandwidth between intra-node (NVLink) and inter-node (IB) interconnects. In practice, GPU Streaming Multiprocessors (SM) threads are used for both network message handling (e.g., filling QPs and WQEs) and data forwarding over NVLink, consuming computational resources. For example, during training, up to 20 of the SMs on the H800 GPU are allocated for communication-related operations, leaving fewer resources available for actual computation. To maximize throughput in online inference, we perform EP all-to-all communication entirely through NIC RDMA, avoiding SM resource contention and improving...
[原文]. For example, node-limited routing strategies employed in DeepSeek-V3 can be further optimized with hardware support for dynamic traffic deduplication. We also recognize emerging interconnect protocols such as the Ultra Ethernet Consortium (UEC) (Consortium, 2023 , 2024 ) , Ultra Accelerator Link (UALink) (CONSORTIUM, 2025 ) , both of which are poised to drive advancements in scale-up and scale-out communication. More recently, Unified Bus (UB) (Liao et al., 2025 ) has introduced a novel approach to scale-up and scale-out convergence. Section 6 further explores several technical innovations p...
[原文]tion instructions to handle memory consistency issues or out-of-order packet arrivals at the hardware level. This would eliminate the need for software-based synchronization mechanisms like RDMA completion events, which introduce extra latency and increase programming complexity. Memory-semantic communication with an acquire/release mechanism is a promising implementation. By implementing these recommendations, future hardware designs can significantly enhance the efficiency of large-scale distributed AI systems while simplifying software development. 4.5. Bandwidth Contention and Latency 4.5....
[原文]The NVIDIA H800 GPU SXM architecture we currently use, illustrated in Figure 2 , is built on the Hopper architecture, similar to the H100 GPU. However, it features reduced FP64 computational performance and NVLink bandwidth for regulatory compliance. Specifically, the NVLink bandwidth in H800 SXM nodes is reduced from 900 GB/s to 400 GB/s. This significant reduction in intra-node scale-up bandwidth presents a challenge for high-performance workloads. To compensate, each node is equipped with eight 400G Infiniband (IB) CX7 NICs, enhancing scale-out capabilities to mitigate the bandwidth deficit...
[原文]To align with the constraints of the H800 architecture, the following parallelism strategies were considered to optimize the performance of DeepSeek-V3: • Avoidance of Tensor Parallelism (TP): Tensor Parallelism is avoided during training due to its inefficiency under limited NVLink bandwidth. However, during inference, TP can still be selectively used to reduce latency and improve TPOT performance. • Enhanced Pipeline Parallelism (PP): DualPipe (DeepSeek-AI, 2025b ) is employed to overlap attention and MoE computation with MoE communication. This also reduces pipeline bubbles and balances mem...
[原文]The bandwidth disparity between scale-up (intra-node) and scale-out (inter-node) communication in the H800 architecture is approximately 4:1. Specifically, NVLink provides 200GB/s bandwidth (of which about 160GB/s can actually be achieved), while each 400Gbps IB NIC delivers only 50GB/s bandwidth (we consider small message size and latency influence, use 40GB/s for effective bandwidth). To balance and fully utilize the higher intra-node bandwidth, the model architecture is co-designed with hardware, particularly in the TopK Expert Selection Strategy . Consider a setup with 8 nodes (64 GPUs in ...
[原文]4.4.1. Limitations of Current Implementations While the Node-Limited Routing strategy reduces communication bandwidth requirements, it complicates communication pipeline kernel implementations due to the disparity in bandwidth between intra-node (NVLink) and inter-node (IB) interconnects. In practice, GPU Streaming Multiprocessors (SM) threads are used for both network message handling (e.g., filling QPs and WQEs) and data forwarding over NVLink, consuming computational resources. For example, during training, up to 20 of the SMs on the H800 GPU are allocated for communication-related operatio...
[原文]framework. By incorporating dedicated co-processors for network traffic management and seamless forwarding between NVLink and IB domains, such designs can reduce software complexity and maximize bandwidth utilization. For example, node-limited routing strategies employed in DeepSeek-V3 can be further optimized with hardware support for dynamic traffic deduplication. We also recognize emerging interconnect protocols such as the Ultra Ethernet Consortium (UEC) (Consortium, 2023 , 2024 ) , Ultra Accelerator Link (UALink) (CONSORTIUM, 2025 ) , both of which are poised to drive advancements in scal...
[原文]ntation. This would not only improve effective bandwidth but also reduce the computational complexity of network-specific operations. (4) Hardware Synchronization Primitives: Provide fine-grained hardware synchronization instructions to handle memory consistency issues or out-of-order packet arrivals at the hardware level. This would eliminate the need for software-based synchronization mechanisms like RDMA completion events, which introduce extra latency and increase programming complexity. Memory-semantic communication with an acquire/release mechanism is a promising implementation. By imple...
[原文]5.1. Network Co-Design: Multi-Plane Fat-Tree During the training of DeepSeek-V3, we deployed a Multi-Plane Fat-Tree (MPFT) scale-out network, as shown in Figure 3 . Each node is equipped with eight GPUs and eight IB NICs, with each GPU–NIC pair assigned to a distinct network plane. Additionally, each node has a 400 Gbps Ethernet RoCE NIC connected to a separate storage network plane for accessing the 3FS (DeepSeek-AI, 2025c ) distributed file system. In the scale-out network, we used 64-port 400G IB switches, enabling the topology theoretically supports up to 16,384 GPUs while retaining the co...
5. Large Scale Network Driven Design
总体而言,多平面架构在故障隔离、鲁棒性、负载均衡和大规模系统可扩展性方面具有显著优势。图 4. 理想多平面网络:每个 NIC 配备多个物理端口,各端口连接至不同的网络平面。单个队列对(QP)可同时利用所有可用端口收发数据包,这需要在 NIC 内原生支持乱序放置。5.1.1. 多平面 Fat-Tree 网络的优势 • Multi-Rail Fat-Tree(MRFT)的子集:MPFT 拓扑构成一种 specifi……
[原文]rall, the multi-plane architecture offers significant advantages in fault isolation, robustness, load balancing, and large-scale system scalability. Figure 4 . Ideal Multi-Plane Network: Each NIC is equipped with multiple physical ports, each connected to a distinct network plane. A single queue pair (QP) can simultaneously utilize all available ports for transmitting and receiving packets, which necessitates native support for out-of-order placement within the NIC. 5.1.1. Advantages of Multi-Plane Fat-Tree Network • Subset of Multi-Rail Fat-Tree (MRFT): The MPFT topology constitutes a specifi...
[原文]very is possible. Table 3. Network topology comparison. Cost estimates are derived from the methodology in the Slim Fly (SF) paper (Blach et al., 2025 ) . DF denotes the canonical dragonfly topology (De Sensi et al., 2020 ; Kim et al., 2008 ; Rahman et al., 2019 ) . Metric FT2 MPFT FT3 SF DF Endpoints 2,048 16,384 65,536 32,928 261,632 Switches 96 768 5,120 1,568 16,352 Links 2,048 16,384 131,072 32,928 384,272 Cost [M$] 9 72 491 146 1,522 Cost/Endpoint [k$] 4.39 4.39 7.5 4.4 5.8 It is important to note that, due to current 400G NDR InfiniBand limitations, cross-plane communication requires in...
5. Large Scale Network Driven Design
延迟,使其成为延迟敏感工作负载(如分布式训练和推理)的首选。虽然IB在延迟性能方面优于RDMA over Converged Ethernet(RoCE)。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]r latency, making it the preferred choice for latency-sensitive workloads such as distributed training and inference. Although IB has superior latency performance compared to RDMA over Converged Ethernet (RoCE), it comes with certain limitations: • Cost: IB hardware is significantly more expensive than RoCE solutions, which limits its widespread adoption. • Scalability: IB switches typically support only 64 ports per switch, compared to the 128 ports commonly found in RoCE switches. This restricts the scalability of IB-based clusters, particularly for large-scale deployments. 5.2.2. Recommenda...
[原文]gnificantly enhance network performance by dynamically spraying packets across multiple paths. While static routing—based on manually configured route tables—can avoid link conflicts for specific destinations, it lacks flexibility. For large-scale all-to-all communication, adaptive routing offers superior performance and scalability. (3) Improved Traffic Isolation or Congestion Control Mechanisms: Current RoCE switches support only a limited number of priority queues, which are insufficient for complex AI workloads involving concurrent communication patterns such as EP’s all-to-all and DP’s al...
[原文]e GPU has prepared the data, it must notify the CPU proxy, which then populates the control information for the work request (WR) and signals the NIC via a doorbell mechanism to initiate data transmission. This process introduces additional communication overhead. IBGDA addresses this issue by allowing the GPU to directly fill the WR content and write to the RDMA doorbell MMIO address. By managing the entire control plane within the GPU, IBGDA eliminates the significant latency overhead associated with GPU-CPU communication. Moreover, when sending a large number of small packets, the control p...
[原文]During the training of DeepSeek-V3, we deployed a Multi-Plane Fat-Tree (MPFT) scale-out network, as shown in Figure 3 . Each node is equipped with eight GPUs and eight IB NICs, with each GPU–NIC pair assigned to a distinct network plane. Additionally, each node has a 400 Gbps Ethernet RoCE NIC connected to a separate storage network plane for accessing the 3FS (DeepSeek-AI, 2025c ) distributed file system. In the scale-out network, we used 64-port 400G IB switches, enabling the topology theoretically supports up to 16,384 GPUs while retaining the cost and latency advantages of a two-layer netw...
[原文]nificant advantages in fault isolation, robustness, load balancing, and large-scale system scalability. Figure 4 . Ideal Multi-Plane Network: Each NIC is equipped with multiple physical ports, each connected to a distinct network plane. A single queue pair (QP) can simultaneously utilize all available ports for transmitting and receiving packets, which necessitates native support for out-of-order placement within the NIC. 5.1.1. Advantages of Multi-Plane Fat-Tree Network • Subset of Multi-Rail Fat-Tree (MRFT): The MPFT topology constitutes a specific subset of the broader MRFT architecture. As...
[原文]omparison. Cost estimates are derived from the methodology in the Slim Fly (SF) paper (Blach et al., 2025 ) . DF denotes the canonical dragonfly topology (De Sensi et al., 2020 ; Kim et al., 2008 ; Rahman et al., 2019 ) . Metric FT2 MPFT FT3 SF DF Endpoints 2,048 16,384 65,536 32,928 261,632 Switches 96 768 5,120 1,568 16,352 Links 2,048 16,384 131,072 32,928 384,272 Cost [M$] 9 72 491 146 1,522 Cost/Endpoint [k$] 4.39 4.39 7.5 4.4 5.8 It is important to note that, due to current 400G NDR InfiniBand limitations, cross-plane communication requires intra-node forwarding, which introduces additio...
[原文]xceeding 40GB/s in a multi-plane network, providing reliable performance that meets the demands of training. Figure 5 . NCCL all-to-all performance from 32 to 128 GPUs for MRFT and MPFT networks. 2. Training Throughput for DeepSeek-V3 Model : We also compare the training metrics of the DeepSeek-V3 model between MPFT and MRFT in Table 4 . MFU (Model Flops Utilization) is calculated based on BF16 peak performance. Causal MFU only takes into account the flops of the lower triangle of the attention matrix (in line with FlashAttention (Dao et al., 2022 ; Dao, 2023 ) ), while non-causal MFU includes...
[原文]In our model inference, large-scale EP relies heavily on all-to-all communication, which is highly sensitive to both bandwidth and latency. Consider a typical scenario discussed in Section 2.3.2 , with a network bandwidth of 50GB/s, the data transfer should ideally take approximately 120 μ s 120\leavevmode\nobreak\ \mu\mathrm{s} . Therefore, the intrinsic network latencies on the order of microseconds can critically impact system performance, making their effects non-negligible. 5.2.1. IB or RoCE As shown in Table 5 , IB consistently achieves lower latency, making it the preferred choice f...
[原文]are looking forward to continuing innovation in this direction. (2) Optimized Route Policy: As shown in Figure 8 , the default Equal-Cost Multi-Path (ECMP) routing policy in RoCE struggles to distribute traffic efficiently across interconnects, leading to severe congestion performance degradation in NCCL collective communication tests. LLM training traffic, such as in DP (Data Parallelism), tends to lack randomness, causing multiple flows to converge on the same interconnect link. In contrast, Adaptive Routing (AR) (Geoffray and Hoefler, 2008 ) can significantly enhance network performance by ...
6. Discussion and Insights for Future Hardware Architecture Design
[原文]Building on the previous sections, we summarize key architectural insights and outline future directions for hardware design tailored to large-scale AI workloads. Section 2.3.2 highlighted the importance of large-scale scale-up networks for accelerating model inference. Section 3 discussed the necessity of efficient support for low-precision computation and communication. Section 4 explored the convergence of scale-up and scale-out architectures, along with several proposed enhancements. Section 5 focused on multi-plane network topologies and identified key improvements needed for Ethernet-bas...
6. Discussion and Insights for Future Hardware Architecture Design
[原文]nsufficient for ensuring system-wide robustness. 6.1.2. Suggestions for Advanced Error Detection and Correction To mitigate risks associated with silent corruption, hardware must incorporate advanced error detection mechanisms beyond traditional ECC. Techniques such as checksum-based validation or hardware-accelerated redundancy checks can provide higher reliability for large-scale deployments. Furthermore, hardware vendors should deliver comprehensive diagnostic toolkits to end users, empowering them to rigorously verify the integrity of their systems and proactively identify any latent silen...
6. Discussion and Insights for Future Hardware Architecture Design
[原文]ove 4 GHz. Furthermore, modern AI workloads require sufficient CPU cores per GPU to prevent control-side bottlenecks. For chiplet-based architectures, additional cores are needed to support cache-aware workload partitioning and isolation. 6.3. Toward Intelligent Networks for AI To meet the demands of latency-sensitive workloads, future interconnects must prioritize both low latency and intelligent networks: • Co-Packaged Optics: Incorporating silicon photonics enables scalable higher bandwidth scalability and enhanced energy efficiency, both are critical for large-scale distributed systems. • ...
6. Discussion and Insights for Future Hardware Architecture Design
[原文]traffic prioritization. For example, inference tasks should be isolated from training traffic in unified clusters, ensuring responsiveness for latency-sensitive applications. 6.4. Discussion on Memory-Semantic Communication and Ordering Issue Inter-node communication using load/store memory semantics is efficient and programmer-friendly, but current implementations are hampered by memory ordering challenges. For example, after writing data, the sender must issue an explicit memory barrier (fence) before updating a flag to notify the receiver, ensuring data consistency. This strict ordering int...
6. Discussion and Insights for Future Hardware Architecture Design
[原文]emantic operations but also message-semantic RDMA primitives, thus broadening its practical applicability. 6.5. In-Network Computation and Compression EP involves two critical all-to-all stages— dispatch and combine —that present significant opportunities for in-network optimization. The dispatch stage resembles a small-scale multicast operation, where a single message must be forwarded to multiple target devices. A hardware-level protocol enabling automatic packet replication and forwarding to multiple destinations could drastically reduce communication overhead and improve efficiency. The co...
6. Discussion and Insights for Future Hardware Architecture Design
[原文]cal bottleneck. Architectures such as SeDRAM (Wang et al., 2023 ) exemplify the potential of this approach, delivering unprecedented performance for memory-bound workloads. • System-on-Wafer (SoW): Wafer-scale integration (Lie, 2022 ) can maximize computational density and memory bandwidth, addressing the needs of ultra-large-scale models.
[原文]6.1.1. Limitations: • Interconnect Failures: High-performance interconnects (e.g., IB and NVLink) are prone to intermittent disconnections, which can disrupt node-to-node communication. This is especially harmful in communication-heavy workloads like EP, where even brief interruptions may lead to significant performance drops or job failures. • Single Hardware Failures: Node crashes, GPU failures, or ECC (Error-Correcting Code) memory errors can compromise long-running training jobs, often requiring costly restarts. The impact of such failures escalates in large-scale deployments, where the pr...
[原文]While accelerator design often takes center stage, CPUs remain essential for coordinating computation, managing I/O, and sustaining system throughput. However, current architectures face several critical bottlenecks: First, as discussed in Section 4.5 , the PCIe interface between CPUs and GPUs often becomes a bandwidth bottleneck, particularly during large-scale parameter, gradient, or KV cache transfers. To mitigate this, future systems should adopt direct CPU–GPU interconnects—such as NVLink or Infinity Fabric—or integrate both CPUs and GPUs into the scale-up domain, thereby eliminating intr...
[原文]To meet the demands of latency-sensitive workloads, future interconnects must prioritize both low latency and intelligent networks: • Co-Packaged Optics: Incorporating silicon photonics enables scalable higher bandwidth scalability and enhanced energy efficiency, both are critical for large-scale distributed systems. • Lossless Network : Credit-Based Flow Control (CBFC) mechanisms ensures lossless data transmission, yet naively triggering flow control can induce severe head-of-line blocking. Therefore, it is imperative to deploy advanced, endpoint-driven congestion control (CC) algorithms that...
6.4. Discussion on Memory-Semantic Communication and Ordering Issue
[原文]Inter-node communication using load/store memory semantics is efficient and programmer-friendly, but current implementations are hampered by memory ordering challenges. For example, after writing data, the sender must issue an explicit memory barrier (fence) before updating a flag to notify the receiver, ensuring data consistency. This strict ordering introduces additional round-trip time (RTT) latency and can stall the issuing thread, impeding inflight stores and reducing throughput. Similar out-of-order synchronization issues arise in message-semantic RDMA; for instance, performing RDMA atom...
[原文]EP involves two critical all-to-all stages— dispatch and combine —that present significant opportunities for in-network optimization. The dispatch stage resembles a small-scale multicast operation, where a single message must be forwarded to multiple target devices. A hardware-level protocol enabling automatic packet replication and forwarding to multiple destinations could drastically reduce communication overhead and improve efficiency. The combine stage, acting as a small-scale reduction operation, could benefit from in-network aggregation techniques. However, due to the small reduction sco...
[原文]6.6.1. Limitations of Memory Bandwidth The exponential growth in model sizes has outpaced advancements in high-bandwidth memory (HBM) technology. This disparity creates a memory bottleneck, particularly in attention-heavy architectures like Transformers. 6.6.2. Suggestions: • DRAM-Stacked Accelerators: Leveraging advanced 3D stacking technologies, DRAM dies can be vertically integrated atop a logic die, thereby enabling exceptionally high memory bandwidth, ultra-low latency, and a practical memory capacity (though stack-limited). This architectural paradigm proves remarkably advantageous for u...
[原文]DeepSeek-V3 exemplifies the transformative potential of hardware-software co-design in advancing the scalability, efficiency, and robustness of large-scale AI systems. By addressing the limitations of current hardware architectures and proposing actionable recommendations, this paper provides a roadmap for the next generation of AI-optimized hardware. These innovations will be critical as AI workloads continue to grow in complexity and scale, driving the future of intelligent systems.