Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
【摘要】Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures - 本文介绍了DeepSeek-V3-Hardware的架构、训练方法和实验结果。
原文: Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y.X. Wei DeepSeek-AI Beijing China (2025) Abstract. The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
国际计算机体系结构研讨会(ISCA '25),2025年6月21-25日,日本东京
doi: 10.1145/3695053.3731412
isbn: 979-8-4007-1261-6/2025/06
CCS: 计算机系统组织 架构
原文: Symposium on Computer Architecture (ISCA ’25), June 21–25, 2025, Tokyo, Japan † † doi: 10.1145/3695053.3731412 † † isbn: 979-8-4007-1261-6/2025/06 † † ccs: Computer systems organization Architectures † † footnotetext: Authors are listed in alphabetical order of their first names. Yuqing Wang and Liyue Zhang are the corresponding authors of this paper (e-mail: research@deepseek.com). 1. Introduction 1.1. Background Large Language Models (LLMs) have undergone rapid evolution in recent years, driven by iterative advancements in model design, computational power, and data availability. In 2024, gr...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
应对这些挑战,阿里巴巴、字节跳动、Google、xAI和Meta等行业领导者已经部署了庞大的训练集群(Jouppi等人,2023;Mudigere等人,2023;Gangidi等人,2024)。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: meet these challenges, industry leaders such as Alibaba, ByteDance, Google, xAI and Meta have deployed colossal training clusters (Jouppi et al., 2023 ; Mudigere et al., 2023 ; Gangidi et al., 2024 ; Jiang et al., 2024 ; Qian et al., 2024 ; xAI, 2024b ) , featuring tens or even hundreds of thousands of GPUs or TPUs. While such massive infrastructures have enabled the development of state-of-the-art models, their exorbitant costs present significant barriers for smaller research teams and organizations. Despite these barriers, open-source startups such as DeepSeek (DeepSeek-AI, 2024b , c , d , ...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
高效扩展LLM而不牺牲性能或可访问性。具体来说,本文关注:
• 硬件驱动的模型设计:分析硬件特性(如FP8低精度计算)如何影响模型架构选择。
• 可扩展性挑战:探讨大规模训练和推理中的瓶颈。
• 硬件设计洞察:从DeepSeek-V3实践中提炼对未来硬件架构的建议。
原文: aling LLMs efficiently without sacrificing performance or accessibility. Specifically, the paper focuses on: • Hardware-Driven Model Design: Analyze how hardware features, such as FP8 low-precision computation and scale-up/scale-out network properties, informed the architectural choices in DeepSeek-V3. • Mutual Dependencies Between Hardware and Models: Investigate how hardware capabilities shape model innovation and how the evolving demands of LLMs drive the need for next-generation hardware. • Future Directions for Hardware Development: Derive actionable insights from DeepSeek-V3 to guide the...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
结果在BF16中呈现。2. DeepSeek模型的设计原则
DeepSeek-V3的开发展示了一种硬件感知的LLM扩展方法,每个设计决策都与硬件限制仔细对齐。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ts in BF16. 2. Design Principles for DeepSeek Models The development of DeepSeek-V3 exemplifies a hardware-aware approach to scaling LLMs, where each design decision was carefully aligned with hardware constraints to optimize performance and cost efficiency. As shown in Figure 1 , DeepSeek-V3 employs the DeepSeekMoE (DeepSeek-AI, 2024e ) and Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c ) architectures that have been proven effective in DeepSeek-V2 (DeepSeek-AI, 2024c ) . DeepSeekMoE unlocks the potential of MoE architecture, while MLA drastically reduces memory consumption by compress...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
缓解AI内存墙挑战。第3节详细讨论了低精度技术。2.1.2. 使用MLA减少KV缓存
对于LLM推理,用户请求通常涉及多轮对话。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: eviating the AI memory wall challenge. A detailed discussion of low-precision techniques is provided in Section 3 Low-Precision Driven Design. 2.1.2. Reducing KV Cache with MLA For LLM inference, user requests often involve multi-turn conversations. To handle these efficiently, the context from previous requests is cached in what is commonly referred to as the KV cache . KV cache addresses this challenge by caching the Key and Value vectors of previously processed tokens, eliminating the need to recompute them for subsequent tokens. During each inference step, the model only computes the Key a...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
共享一组KV对,显著压缩KV存储。代表性方法包括GQA(Ainsline等人,2023)和MQA(Shazeer,2019)。
• 窗口化KV:对于长序列,只保留滑动窗口内的KV对在缓存中,丢弃旧的。
原文: share a single set of KV pairs, significantly compressing KV storage. Representative methods include GQA (Ainslie et al., 2023 ) and MQA (Shazeer, 2019 ) . • Windowed KV: For long sequences, only a sliding window of KV pairs is retained in the cache, discarding results outside the window. While this reduces storage, it compromises long-context reasoning. Representative methods include Longformer (Beltagy et al., 2020 ) and related architectures. • Quantized Compression: KV pairs are stored using low-bit representations (Hooper et al., 2024 ; Liu et al., 2024 ; Kang et al., 2024 ) , further red...
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
稀疏注意力(Yuan等人,2025),它们试图压缩和稀疏激活注意力键和值,代表了克服注意力相关计算挑战的又一次尝试。我们期待与这些方向的协作进展。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: attention (Yuan et al., 2025 ) , which seek to compress and sparsely activate attention keys and values, represent another attempt at overcoming the computational challenges associated with attention. We look forward to collaborative progress with the broader community toward breakthroughs in this area. 2.2. Cost-Effectiveness of MoE Models For sparse computing, we have developed DeepSeekMoE, an advanced Mixture of Experts (MoE) architecture, which is illustrated in the lower right part of Figure 1 . The advantages of MoE models lie in two folds. 2.2.1. Reducing Computational Requirements for ...
1. Introduction
1. 引言
大型语言模型(LLM)近年来经历了快速发展,模型设计、计算能力和数据可用性的迭代进步推动了这一趋势。2024年,突破性模型(如GPT-4、Claude 3、Gemini 1.5)展示了非凡的能力,同时也凸显了训练成本的可观性。
原文: 1.1. Background Large Language Models (LLMs) have undergone rapid evolution in recent years, driven by iterative advancements in model design, computational power, and data availability. In 2024, groundbreaking models such as GPT4o (OpenAI, 2024a ) , LLaMa-3 (AI@Meta, 2024a ) , Claude 3.5 Sonnet (Anthropic, 2024 ) , Grok-2 (xAI, 2024a ) , Qwen2.5 (Yang et al., 2024 ) , Gemini-2 (Google, 2024 ) and our DeepSeek-V3 (DeepSeek-AI, 2024d ) have showcased remarkable progress, further narrowing the gap towards Artificial General Intelligence (AGI). As the Scaling Laws (Kaplan et al., 2020 ) shows, in...
1. Introduction
然而,其高昂的成本给小型研究团队和组织带来了重大障碍。尽管存在这些障碍,DeepSeek等开源创业公司(DeepSeek-AI,2024b, c, d, a, 2025a)和Mistral等公司正通过创新的硬件感知设计挑战这些限制。
原文: odels, their exorbitant costs present significant barriers for smaller research teams and organizations. Despite these barriers, open-source startups such as DeepSeek (DeepSeek-AI, 2024b , c , d , a , 2025a ) and Mistral (Jiang et al., 2023 ; Mistral, 2024 ) are also striving to develop state-of-the-art models. Among them, DeepSeek has especially demonstrated that effective software-hardware co-design can enable cost-efficient training of large models, leveling the playing field for smaller teams. Building on this tradition, DeepSeek-V3 (DeepSeek-AI, 2024d ) represents a new milestone in cost-...
1. Introduction
技术创新以及如何LLM需求驱动下一代硬件需求的洞察。
• 硬件开发的未来方向:从DeepSeek-V3中提炼可操作的见解,指导协同设计。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: l innovation and how the evolving demands of LLMs drive the need for next-generation hardware. • Future Directions for Hardware Development: Derive actionable insights from DeepSeek-V3 to guide the co-design of future hardware and model architectures, paving the way for scalable, cost-efficient AI systems. 1.3. Structure of this Paper The remainder of this paper is organized as follows. Section 2 explores the design principles underpinning DeepSeek-V3 model architecture, highlighting key innovations such as Multi-head Latent Attention, Mixture-of-Experts optimizations and Multi-Token Predictio...
1.1. Background
1.1. 背景
大型语言模型(LLM)近年来经历了快速发展,由模型设计、计算能力和数据可用性的迭代进步推动。2024年,突破性模型(如GPT-4、Claude 3、Gemini 1.5)展示了非凡的能力,同时也凸显了训练成本的可观性。
原文: Large Language Models (LLMs) have undergone rapid evolution in recent years, driven by iterative advancements in model design, computational power, and data availability. In 2024, groundbreaking models such as GPT4o (OpenAI, 2024a ) , LLaMa-3 (AI@Meta, 2024a ) , Claude 3.5 Sonnet (Anthropic, 2024 ) , Grok-2 (xAI, 2024a ) , Qwen2.5 (Yang et al., 2024 ) , Gemini-2 (Google, 2024 ) and our DeepSeek-V3 (DeepSeek-AI, 2024d ) have showcased remarkable progress, further narrowing the gap towards Artificial General Intelligence (AGI). As the Scaling Laws (Kaplan et al., 2020 ) shows, increasing model s...
1.1. Background
高昂的成本给小型研究团队和组织带来了重大障碍。尽管存在这些障碍,DeepSeek等开源创业公司和Mistral等公司正通过创新的硬件感知设计挑战这些限制。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: rbitant costs present significant barriers for smaller research teams and organizations. Despite these barriers, open-source startups such as DeepSeek (DeepSeek-AI, 2024b , c , d , a , 2025a ) and Mistral (Jiang et al., 2023 ; Mistral, 2024 ) are also striving to develop state-of-the-art models. Among them, DeepSeek has especially demonstrated that effective software-hardware co-design can enable cost-efficient training of large models, leveling the playing field for smaller teams. Building on this tradition, DeepSeek-V3 (DeepSeek-AI, 2024d ) represents a new milestone in cost-effective traini...
1.2. Objectives
1.2. 目标
本文并不旨在重复DeepSeek-V3的详细架构和算法细节(这些在其技术报告中已有详尽记录)。相反,它从硬件和系统角度审视DeepSeek-V3,提炼对AI硬件设计的广泛见解。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: This paper does not aim to reiterate the detailed architectural and algorithmic specifics of DeepSeek-V3, which are extensively documented in its technical report (DeepSeek-AI, 2024d ) . Instead, it adopts a dual perspective—spanning hardware architecture and model design—to explore the intricate interplay between them in achieving cost-efficient large-scale training and inference. By examining this synergy, we aim to provide actionable insights for scaling LLMs efficiently without sacrificing performance or accessibility. Specifically, the paper focuses on: • Hardware-Driven Model Design: Ana...
1.3. Structure of this Paper
1.3. 本文结构
本文其余部分组织如下。第2节探讨支撑DeepSeek-V3模型架构的设计原则,突出多头潜伏注意力(MLA)和DeepSeekMoE等关键创新。第3节分析低精度训练策略。第4节讨论互连驱动设计。
原文: The remainder of this paper is organized as follows. Section 2 explores the design principles underpinning DeepSeek-V3 model architecture, highlighting key innovations such as Multi-head Latent Attention, Mixture-of-Experts optimizations and Multi-Token Prediction Module. Section 3 illustrates how our model architecture pursues low-precision computation and communication. Section 4 includes scale-up interconnection optimizations, discusses scale-up/scale-out convergence, and explores how hardware features influence parallelism and expert selection strategies. Section 5 focuses on scale-out net...
2. Design Principles for DeepSeek Models
2. DeepSeek模型的设计原则
DeepSeek-V3的开发展示了一种硬件感知的LLM扩展方法,每个设计决策都与硬件限制仔细对齐,以优化性能和成本效益。本节详细介绍了这些原则。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: The development of DeepSeek-V3 exemplifies a hardware-aware approach to scaling LLMs, where each design decision was carefully aligned with hardware constraints to optimize performance and cost efficiency. As shown in Figure 1 , DeepSeek-V3 employs the DeepSeekMoE (DeepSeek-AI, 2024e ) and Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c ) architectures that have been proven effective in DeepSeek-V2 (DeepSeek-AI, 2024c ) . DeepSeekMoE unlocks the potential of MoE architecture, while MLA drastically reduces memory consumption by compressing Key-Value (KV) caches. In addition, DeepSeek-V3 i...
2. Design Principles for DeepSeek Models
第3节详细讨论了低精度技术。2.1.2. 使用MLA减少KV缓存
对于LLM推理,用户请求通常涉及多轮对话。为了减少KV缓存内存消耗,DeepSeek-V3采用了MLA(多头潜伏注意力),显著降低了内存需求。
原文: cussion of low-precision techniques is provided in Section 3 Low-Precision Driven Design. 2.1.2. Reducing KV Cache with MLA For LLM inference, user requests often involve multi-turn conversations. To handle these efficiently, the context from previous requests is cached in what is commonly referred to as the KV cache . KV cache addresses this challenge by caching the Key and Value vectors of previously processed tokens, eliminating the need to recompute them for subsequent tokens. During each inference step, the model only computes the Key and Value vectors for the current token and performs a...
2. Design Principles for DeepSeek Models
减少KV存储。代表性方法包括GQA(Ainsline等人,2023)和MQA(Shazeer,2019)。
• 窗口化KV:对于长序列,只保留滑动窗口内的KV对在缓存中,丢弃旧的。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: sing KV storage. Representative methods include GQA (Ainslie et al., 2023 ) and MQA (Shazeer, 2019 ) . • Windowed KV: For long sequences, only a sliding window of KV pairs is retained in the cache, discarding results outside the window. While this reduces storage, it compromises long-context reasoning. Representative methods include Longformer (Beltagy et al., 2020 ) and related architectures. • Quantized Compression: KV pairs are stored using low-bit representations (Hooper et al., 2024 ; Liu et al., 2024 ; Kang et al., 2024 ) , further reducing memory usage. Quantization achieves significant...
2. Design Principles for DeepSeek Models
效率和稀疏激活注意力键和值,代表了克服注意力相关计算挑战的又一次尝试。我们期待与这些方向的协作进展。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ess and sparsely activate attention keys and values, represent another attempt at overcoming the computational challenges associated with attention. We look forward to collaborative progress with the broader community toward breakthroughs in this area. 2.2. Cost-Effectiveness of MoE Models For sparse computing, we have developed DeepSeekMoE, an advanced Mixture of Experts (MoE) architecture, which is illustrated in the lower right part of Figure 1 . The advantages of MoE models lie in two folds. 2.2.1. Reducing Computational Requirements for Training The primary advantage of the MoE architectu...
2. Design Principles for DeepSeek Models
在单次请求场景中的独特优势。因为每个请求只激活一部分参数,内存和计算需求大大减少。例如,DeepSeek-V2(236B参数,每token激活21B)在单次请求场景中表现出色。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: unique advantages in single-request scenarios. Because only a subset of parameters is activated per request, memory and computational demands are greatly reduced. For example, DeepSeek-V2 (236B parameters) activates just 21B parameters during inference. This enables PCs with AI SoC chips (Apple, 2024 ; NVIDIA, 2025 ; AMD, 2025 ) to achieve nearly 20 tokens per second (TPS), or even twice that speed, which is more than sufficient for personal use. In contrast, dense models of similar capability (e.g., 70B parameters) typically reach only single-digit TPS on similar hardware. Notably, the increa...
2. Design Principles for DeepSeek Models
通信步骤。这种流水线方法实现了all-to-all通信与持续计算的无缝重叠,确保GPU始终充分利用。此外,在生产环境中,这种方法也降低了延迟。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: communication step. This pipelined approach enables seamless overlap of all-to-all communication with ongoing computation, ensuring that the GPU remains fully utilized at all times. Moreover, in production, we adopt a prefill and decode disaggregation architecture (Zhong et al., 2024 ) , assigning large batch size prefill and latency-sensitive decode requests to different expert parallelism group sizes. This strategy ultimately maximizes system throughput under real-world service conditions. Table 2. Comparison of computational costs for training MoE and dense models: Computational cost per to...
2. Design Principles for DeepSeek Models
每个设备在专家并行期间处理相等的批次大小,使通信时间易于计算。对于使用CX7 400Gbps InfiniBand(IB)NIC互连的系统,通信时间约为。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: hat each device processes an equal batch size during expert parallelism, allowing the communication time to be easily calculated. For a system interconnected with CX7 400Gbps InfiniBand (IB) NICs, the time required for the two all-to-all communications in EP is calculated as follows: Comm. Time = ( 1 Byte + 2 Bytes ) × 32 × 9 × 7 K / 50 GB/s = 120.96 μ s \text{Comm. Time}=(1\text{Byte}+2\text{Bytes})\times 32\times 9\times 7\text{K}/50\text{GB/s}=120.96\mu s Here, dispatch uses FP8 (1 byte), while combine uses BF16 (2 bytes), and the hidden size of each token is approximately 7K. T...
2. Design Principles for DeepSeek Models
通信时间降至:
通信时间 = (1字节 + 2字节) × 32 × 9 × 7K / 900GB/s = 6.72微秒 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ep drops to: Comm. Time = ( 1 Byte + 2 Bytes ) × 32 × 9 × 7 K / 900 GB/s = 6.72 μ s \text{Comm. Time}=(1\text{Byte}+2\text{Bytes})\times 32\times 9\times 7\text{K}/900\text{GB/s}=6.72\mu s Assuming the computation time is equal to the communication time, this reduces the total inference time significantly, enabling a theoretical upper limit of over 0.82 ms TPOT , approximately 1200 tokens per second . While this figure is purely theoretical and has not been empirically validated, it vividly illustrates the transformative potential of high-bandwidth scale-up networks in accelerating...
2. Design Principles for DeepSeek Models
每个token,与没有MTP模块的场景相比,生成吞吐量提高了1.8倍。此外,通过每步预测多个token,MTP增加了推理批次大小,这对于大规模部署至关重要。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: oken, which increases the generation TPS by 1.8x compared to the scenario without the MTP module. Moreover, by predicting multiple tokens per step, MTP increases the inference batch size, which is crucial for boosting EP computational intensity and hardware utilization. Such algorithmic innovations are vital for fast and cost-effective inference in DeepSeek-V3. 2.3.4. High Inference Speed for Reasoning Models and Test-Time Scaling Test-time scaling in LLMs, exemplified by OpenAI’s o1/o3 series (OpenAI, 2024b , 2025 ) , has enabled significant advances in mathematical reasoning, programming, an...
2. Design Principles for DeepSeek Models
FP8混合精度计算和网络协同设计的MoE门路由。鉴于在大规模模型上进行全面消融实验的成本高昂,我们采用分层和高效的验证方法。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: P8 mixed-precision computation, and network co-designed MoE gate routing. Given the prohibitive cost of exhaustive ablation on full-scale models, we adopt a hierarchical and resource-efficient validation pipeline. Each technique is first validated extensively on small-scale models, followed by minimal large-scale tuning, and finally integrated in a single, comprehensive training run. For instance, we first conducted fine-grained FP8 training ablation studies on both 16B and 230B DeepSeek-V2 models before final integration. Under these controlled settings, the relative accuracy loss compared to...
2.1. Memory Efficiency
2.1. 内存效率
LLM通常需要大量内存资源,内存需求每年增长超过1000%。相比之下,高速内存(如HBM)容量的增长速度要慢得多。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: LLMs generally require significant memory resources, with memory demands increasing by more than 1000% per year. In contrast, the growth rate of high-speed memory (e.g., HBM) capacity is much slower, typically less than 50% per year (Gholami et al., 2024 ) . While multi-node parallelism is a viable solution to address memory limitations, optimizing memory usage at the source remains a crucial and effective strategy. 2.1.1. Low-Precision Models Compared to models that utilize BF16 for weights, FP8 significantly reduces memory consumption by half, effectively alleviating the AI memory wall chall...
2.1. Memory Efficiency
与模型一起训练。推理时,只需要缓存潜伏向量,与存储所有注意力头的KV缓存相比,显著减少了内存消耗。除了KV缓存压缩外,我们还采用了其他技术。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: trained with the model. During inference, only the latent vector needs to be cached, significantly reducing memory consumption compared to storing the KV cache for all attention heads. In addition to MLA, several other approaches have been proposed to reduce the size of the KV cache. These methods are highly valuable and provide significant inspiration for advancements in memory-efficient attention mechanisms: • Shared KV (Grouped-Query Attention, GQA; Multi-Query Attention, MQA): Instead of maintaining separate KV pairs for each attention head, multiple heads share a single set of KV pairs, s...
2.1. Memory Efficiency
展望和资源高效技术视角
虽然减少KV缓存大小是提高内存效率的有前景的方法,但Transformer中固有的二次复杂度仍然是关键挑战。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ons and Perspectives on Resource-Efficient Techniques While reducing the size of the KV cache is a promising method for improving memory efficiency, the quadratic complexity inherent in Transformer-based autoregressive decoding remains a formidable challenge, especially for extremely long contexts. Recent research efforts, such as Mamba-2 (Dao and Gu, 2024 ) and Lightning Attention (Qin et al., 2024 ) , investigate linear-time alternatives that offer new possibilities for balancing computational cost and model performance. In addition, approaches such as sparse attention (Yuan et al., 2025 ) ,...
2.2. Cost-Effectiveness of MoE Models
2.2. MoE模型的成本效益
对于稀疏计算,我们开发了DeepSeekMoE,一种先进的混合专家(MoE)架构,如图1右下部分所示。MoE模型的优势在于每token只激活部分参数,大幅降低推理成本。
原文: For sparse computing, we have developed DeepSeekMoE, an advanced Mixture of Experts (MoE) architecture, which is illustrated in the lower right part of Figure 1 . The advantages of MoE models lie in two folds. 2.2.1. Reducing Computational Requirements for Training The primary advantage of the MoE architecture lies in its ability to significantly reduce training costs. By selectively activating only a subset of expert parameters, MoE models allow the total parameter count to scale up dramatically while keeping computational requirements modest. For example, DeepSeek-V2 features 236B parameters...
2.2. Cost-Effectiveness of MoE Models
Apple,2024;NVIDIA,2025;AMD,2025)实现了接近每秒20个token(TPS),甚至两倍速度,这对个人使用来说已经足够。相比之下,类似容量的密集模型在边缘设备上难以部署。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Apple, 2024 ; NVIDIA, 2025 ; AMD, 2025 ) to achieve nearly 20 tokens per second (TPS), or even twice that speed, which is more than sufficient for personal use. In contrast, dense models of similar capability (e.g., 70B parameters) typically reach only single-digit TPS on similar hardware. Notably, the increasingly popular KTransformers (group and Approaching.AI, 2025 ) inference engine allows the complete DeepSeek-V3 model to run on a low-cost server equipped with a consumer GPU (costing approximately $10,000), while still achieving nearly 20 TPS. This efficiency makes MoE architectures suita...
2.3. Increasing Inference Speed
2.3. 提高推理速度
2.3.1. 计算与通信重叠:最大化吞吐量
推理速度包括系统最大吞吐量和单次请求延迟。为最大化吞吐量,我们的模型设计了计算与通信的重叠策略。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 2.3.1. Overlapping Computation and Communication: Maximizing Throughput Inference speed encompasses both system-wide maximum throughput and single-request latency. To maximize throughput, our model is architected from the outset to leverage dual micro-batch overlap (DeepSeek-AI, 2025d ; Zhao et al., 2025b ) , intentionally overlapping communication latency with computation. As demonstrated in our online inference system and supported by open-source profiling data (DeepSeek-AI, 2025d ) , we decouple the computation of MLA and MoE into two distinct stages. While one micro-batch executes a portio...
2.3. Increasing Inference Speed
增强其智能。对于MoE模型,实现高推理速度依赖于高效地将专家参数部署到计算设备上。为实现尽可能快的推理速度,我们优化了专家分布策略。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: enhance their intelligence. For MoE models, achieving high inference speed relies on efficiently deploying expert parameters across computing devices. To achieve the fastest possible inference speed, each device should ideally perform computations for a single expert (or multiple devices should collaboratively compute a single expert if necessary). However, Expert Parallelism (EP) requires routing tokens to the appropriate devices, which involves all-to-all communication across the network. As a result, the upper limit of MoE inference speed is dictated by interconnection bandwidth. Consider a...
2.3. Increasing Inference Speed
双微批次重叠。在此假设下,每层总时间可表述为:
每层总时间 = 2 × 120.96微秒 = 241.92微秒 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: dual micro-batch overlap. Under this assumption, the total time per layer can be formulated as: Total Time Per Layer = 2 × 120.96 μ s = 241.92 μ s \text{Total Time Per Layer}=2\times 120.96\mu s=241.92\mu s With 61 layers in DeepSeek-V3, the total inference time is: Total Inference Time = 61 × 241.92 μ s = 14.76 ms \text{Total Inference Time}=61\times 241.92\mu s=14.76\text{ms} Thus, the theoretical upper limit for this system is approximately 14.76 ms TPOT , equivalent to 67 tokens per second . However, in practice, factors such as communication overhead, latency, incomplete ban...
2.3. Increasing Inference Speed
顺序瓶颈。MTP通过使模型能够以更低成本生成额外候选token并并行验证,缓解了这一问题,类似于之前的基于自草稿的推测解码方法。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: equential bottlenecks. MTP mitigates this issue by enabling the model to generate additional candidate tokens at a lower cost and verify them in parallel, similar to previous self-drafting-based speculative decoding approaches (Cai et al., 2024 ; Li et al., 2024 ) . This framework significantly accelerates inference without compromising accuracy. As illustrated in the top part of Figure 1 , each MTP module uses a single layer, which is much more lightweight than the full model, to predict additional tokens, enabling parallel verification of multiple candidate tokens. Although slightly hurting ...
2.3. Increasing Inference Speed
等人,2024)——快速生成大量样本的必要性使推理吞吐量成为关键瓶颈。同样,漫长的推理序列会增加用户等待时间,降低用户体验。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: l., 2024 ) —the necessity to rapidly generate large numbers of samples makes inference throughput a critical bottleneck. Likewise, prolonged reasoning sequences can increase user wait times, reducing the practical usability of such models. As a result, optimizing inference speed through synergistic hardware and software innovations is indispensable for advancing the efficiency of reasoning models. However, effective strategies for accelerating inference and expediting RL training remain active areas of investigation, as discussed in Section 2.1.3 . We encourage the broader community to collabo...
2.4. Technique Validation Methodology
2.4. 技术验证方法
每项加速技术都经过严格的经验验证以评估其准确性影响,包括MLA、FP8混合精度计算和网络协同设计的MoE门路由。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Each acceleration technique undergoes rigorous empirical validation to evaluate its accuracy impact, including MLA, FP8 mixed-precision computation, and network co-designed MoE gate routing. Given the prohibitive cost of exhaustive ablation on full-scale models, we adopt a hierarchical and resource-efficient validation pipeline. Each technique is first validated extensively on small-scale models, followed by minimal large-scale tuning, and finally integrated in a single, comprehensive training run. For instance, we first conducted fine-grained FP8 training ablation studies on both 16B and 230B...
3. Low-Precision Driven Design
3. 低精度驱动设计
3.1. FP8混合精度训练
量化技术如GPTQ(Frantar等人,2022)和AWQ(Lin等人,2024)已被广泛用于将位宽降低到8位、4位甚至更低,显著减少内存需求。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 3.1. FP8 Mix-Precision Training Quantization techniques such as GPTQ (Frantar et al., 2022 ) and AWQ (Lin et al., 2024 ) have been widely used to reduce bit-widths to 8-bit, 4-bit, or even lower, significantly reducing memory requirements. However, these techniques are primarily applied during inference to save memory, rather than in the training phase. NVIDIA’s Transformer Engine has supported FP8 mixed-precision training for some time, but prior to DeepSeek-V3, there were no open-source large models leveraging FP8 for training. Through deep collaboration between our infrastructure and algori...
3. Low-Precision Driven Design
在将部分结果从Tensor Core传输到CUDA Core进行缩放因子乘法时产生大量反量化开销。这导致了频繁的数据移动,降低了计算效率。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: large dequantization overhead in transporting the partial results from Tensor Cores to CUDA Cores for scaling factor multiplication. This incurs frequent data movements, reducing computational efficiency and complicating hardware utilization. 3.1.2. Suggestions: To address the limitations of existing hardware, we have the following suggestions for future designs: • Increased Accumulation Precision: Hardware should improve the accumulation register precision to an appropriate value (e.g. FP32), or support a configurable accumulation precision, enabling a trade-off between performance and accura...
3. Low-Precision Driven Design
将位数扩展到10位,我们发现它类似于BF16合并阶段。通过将激活从原始线性空间映射到对数空间,激活的分布更加均匀。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: the number of bits with the leading 1 bit as the sign bit S S . By mapping the activations from the original Linear space to the Log space, the distribution of the activations is more uniform. To be specific, given a tile of elements, [ x 1 , ⋯ , x m ] [x_{1},\cdots,x_{m}] , which is 1x128 in our implementation, we take the absolute values and compute the logarithm of all the elements, and find the minimum m i n = l o g ( a b s ( x i ) ) min=log(abs(x_{i})) and maximum m a x = l o g ( a b s ( x j ) ) max=log(abs(x_{j})) . The minimum is encoded as S .00 ⋯ ...
3. Low-Precision Driven Design
3.2.1. 局限性:使用LogFMT的初始目的是将其应用于传输期间或接近激活函数的激活。由于其对数性质,它并不适合所有场景。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: sing the n n to 10 bits, we find it’s similar to the BF16 combine stage. 3.2.1. Limitations: The initial purpose of using LogFMT is to apply it to activations during transmission or near activation functions, as it offers higher precision than FP8 with the same bit width. However, subsequent computations require reconversion to BF16 or FP8 to accommodate the Hopper GPU tensor cores’ data type. Due to insufficient GPU bandwidth for log/exp operations and excessive register pressure during encode/decode, if encode/decode operations are fused with all-to-all communication, the overhead can be sub...
3.1. FP8 Mix-Precision Training
量化技术如GPTQ(Frantar等人,2022)和AWQ(Lin等人,2024)已被广泛用于将位宽降低到8位、4位甚至更低,显著减少内存需求。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Quantization techniques such as GPTQ (Frantar et al., 2022 ) and AWQ (Lin et al., 2024 ) have been widely used to reduce bit-widths to 8-bit, 4-bit, or even lower, significantly reducing memory requirements. However, these techniques are primarily applied during inference to save memory, rather than in the training phase. NVIDIA’s Transformer Engine has supported FP8 mixed-precision training for some time, but prior to DeepSeek-V3, there were no open-source large models leveraging FP8 for training. Through deep collaboration between our infrastructure and algorithm teams, and after extensive e...
3.1. FP8 Mix-Precision Training
在将部分结果从Tensor Core传输到CUDA Core进行缩放因子乘法时产生大量反量化开销。这导致了频繁的数据移动,降低了计算效率,并增加了硬件实现的复杂性。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: n transporting the partial results from Tensor Cores to CUDA Cores for scaling factor multiplication. This incurs frequent data movements, reducing computational efficiency and complicating hardware utilization. 3.1.2. Suggestions: To address the limitations of existing hardware, we have the following suggestions for future designs: • Increased Accumulation Precision: Hardware should improve the accumulation register precision to an appropriate value (e.g. FP32), or support a configurable accumulation precision, enabling a trade-off between performance and accuracy for different requirements o...
3.2. LogFMT: Communication Compression
3.2. LogFMT:通信压缩
在当前DeepSeek-V3架构中,我们采用低精度压缩进行网络通信。在EP并行期间,token使用细粒度FP8量化进行分发,减少通信开销。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In the current DeepSeek-V3 architecture, we employ low-precision compression for network communication. During EP parallelism, tokens are dispatched using fine-grained FP8 quantization, reducing communication volume by 50% compared to BF16. This significantly lowers communication time. While the combine stage still uses higher precision (e.g., BF16) due to accuracy requirements, we are actively testing FP8, custom precision formats (e.g., E5M6) and mixing FP8-BF16 for further reductions. Besides these traditional floating point formats, we also tried a new data type, named Logarithmic Floating...
3.2. LogFMT: Communication Compression
重要的是在原始线性空间中进行舍入,而不是在对数空间中,以实现无偏激活量化。我们还约束min大于max - log(2^32)。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ortant to round in the original Linear space, instead of the Log space, for the unbiased activation quantization. We also constrain the m i n min to be larger than m a x − l o g ( 2 32 ) max-log(2^{32}) , which means that the max representation range is similar to E5, a floating point with 5 exponents. We validate our LogFMT-nBit on dense language models with around 7 billion parameters, by quantifying the output of the residual branch to simulate the combine stage in MoE models. When setting n = 8 n=8 , sharing the same bits with FP8, the LogFMT-8Bit shows superior training accu...
4. Interconnection Driven Design
4. 互连驱动设计
4.1. 当前硬件架构
我们当前使用的NVIDIA H800 GPU SXM架构如图2所示,基于Hopper架构,类似于H100 GPU。然而,它具有降低的FP64计算性能。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 4.1. Current Hardware Architecture The NVIDIA H800 GPU SXM architecture we currently use, illustrated in Figure 2 , is built on the Hopper architecture, similar to the H100 GPU. However, it features reduced FP64 computational performance and NVLink bandwidth for regulatory compliance. Specifically, the NVLink bandwidth in H800 SXM nodes is reduced from 900 GB/s to 400 GB/s. This significant reduction in intra-node scale-up bandwidth presents a challenge for high-performance workloads. To compensate, each node is equipped with eight 400G Infiniband (IB) CX7 NICs, enhancing scale-out capabilitie...
4. Interconnection Driven Design
H800架构中扩展(节点内)和扩展(节点间)通信之间的比例约为4:1。具体来说,NVLink提供200GB/s带宽(其中约160GB/s可实际使用)。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: between scale-up (intra-node) and scale-out (inter-node) communication in the H800 architecture is approximately 4:1. Specifically, NVLink provides 200GB/s bandwidth (of which about 160GB/s can actually be achieved), while each 400Gbps IB NIC delivers only 50GB/s bandwidth (we consider small message size and latency influence, use 40GB/s for effective bandwidth). To balance and fully utilize the higher intra-node bandwidth, the model architecture is co-designed with hardware, particularly in the TopK Expert Selection Strategy . Consider a setup with 8 nodes (64 GPUs in total) and 256 routed ex...
4. Interconnection Driven Design
节点内(NVLink)和节点间(IB)互连之间的带宽差异。在实践中,GPU流多处理器(SM)线程用于网络消息处理(如填充QPs和WQEs)。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: bandwidth between intra-node (NVLink) and inter-node (IB) interconnects. In practice, GPU Streaming Multiprocessors (SM) threads are used for both network message handling (e.g., filling QPs and WQEs) and data forwarding over NVLink, consuming computational resources. For example, during training, up to 20 of the SMs on the H800 GPU are allocated for communication-related operations, leaving fewer resources available for actual computation. To maximize throughput in online inference, we perform EP all-to-all communication entirely through NIC RDMA, avoiding SM resource contention and improving...
4. Interconnection Driven Design
例如,DeepSeek-V3中采用的节点限制路由策略可以通过硬件支持动态流量去重进一步优化。我们还认识到新兴互连协议的潜力。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: . For example, node-limited routing strategies employed in DeepSeek-V3 can be further optimized with hardware support for dynamic traffic deduplication. We also recognize emerging interconnect protocols such as the Ultra Ethernet Consortium (UEC) (Consortium, 2023 , 2024 ) , Ultra Accelerator Link (UALink) (CONSORTIUM, 2025 ) , both of which are poised to drive advancements in scale-up and scale-out communication. More recently, Unified Bus (UB) (Liao et al., 2025 ) has introduced a novel approach to scale-up and scale-out convergence. Section 6 further explores several technical innovations p...
4. Interconnection Driven Design
同步原语来处理内存一致性问题或硬件级别的乱序数据包到达。这将消除对基于RDMA完成事件等软件同步机制的需求。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: tion instructions to handle memory consistency issues or out-of-order packet arrivals at the hardware level. This would eliminate the need for software-based synchronization mechanisms like RDMA completion events, which introduce extra latency and increase programming complexity. Memory-semantic communication with an acquire/release mechanism is a promising implementation. By implementing these recommendations, future hardware designs can significantly enhance the efficiency of large-scale distributed AI systems while simplifying software development. 4.5. Bandwidth Contention and Latency 4.5....
4. Interconnection Driven Design
这种方法可以显著改善训练和推理期间GPU和CPU内存之间卸载参数或KV缓存等场景。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: approach can significantly improve scenarios such as offloading parameters or KV cache between GPU and CPU memory during training and inference.
4.1. Current Hardware Architecture
4.1. 当前硬件架构
我们当前使用的NVIDIA H800 GPU SXM架构如图2所示,基于Hopper架构,类似于H100 GPU。然而,它具有降低的FP64计算性能以符合出口管制。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: The NVIDIA H800 GPU SXM architecture we currently use, illustrated in Figure 2 , is built on the Hopper architecture, similar to the H100 GPU. However, it features reduced FP64 computational performance and NVLink bandwidth for regulatory compliance. Specifically, the NVLink bandwidth in H800 SXM nodes is reduced from 900 GB/s to 400 GB/s. This significant reduction in intra-node scale-up bandwidth presents a challenge for high-performance workloads. To compensate, each node is equipped with eight 400G Infiniband (IB) CX7 NICs, enhancing scale-out capabilities to mitigate the bandwidth deficit...
4.2. Hardware-Aware Parallelism
4.2. 硬件感知并行策略
为与H800架构限制对齐,以下并行策略被考虑以优化DeepSeek-V3性能:
• 避免张量并行(TP):张量并行引入大量通信开销。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: To align with the constraints of the H800 architecture, the following parallelism strategies were considered to optimize the performance of DeepSeek-V3: • Avoidance of Tensor Parallelism (TP): Tensor Parallelism is avoided during training due to its inefficiency under limited NVLink bandwidth. However, during inference, TP can still be selectively used to reduce latency and improve TPOT performance. • Enhanced Pipeline Parallelism (PP): DualPipe (DeepSeek-AI, 2025b ) is employed to overlap attention and MoE computation with MoE communication. This also reduces pipeline bubbles and balances mem...
4.3. Model Co-Design: Node-Limited Routing
4.3. 模型协同设计:节点限制路由
H800架构中扩展(节点内)和扩展(节点间)通信之间的带宽差异约为4:1。具体来说,NVLink提供200GB/s带宽。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: The bandwidth disparity between scale-up (intra-node) and scale-out (inter-node) communication in the H800 architecture is approximately 4:1. Specifically, NVLink provides 200GB/s bandwidth (of which about 160GB/s can actually be achieved), while each 400Gbps IB NIC delivers only 50GB/s bandwidth (we consider small message size and latency influence, use 40GB/s for effective bandwidth). To balance and fully utilize the higher intra-node bandwidth, the model architecture is co-designed with hardware, particularly in the TopK Expert Selection Strategy . Consider a setup with 8 nodes (64 GPUs in ...
4.4. Scale-Up and Scale-Out Convergence
4.4. 扩展与扩展收敛
4.4.1. 当前实现的局限性
虽然节点限制路由策略减少了通信带宽需求,但由于不同节点间通信模式的差异,它使通信管道内核实现复杂化。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 4.4.1. Limitations of Current Implementations While the Node-Limited Routing strategy reduces communication bandwidth requirements, it complicates communication pipeline kernel implementations due to the disparity in bandwidth between intra-node (NVLink) and inter-node (IB) interconnects. In practice, GPU Streaming Multiprocessors (SM) threads are used for both network message handling (e.g., filling QPs and WQEs) and data forwarding over NVLink, consuming computational resources. For example, during training, up to 20 of the SMs on the H800 GPU are allocated for communication-related operatio...
4.4. Scale-Up and Scale-Out Convergence
框架。通过整合专用协处理器进行网络流量管理和NVLink与IB域之间的无缝转发,此类设计可以减少软件复杂性并最大化带宽利用率。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: framework. By incorporating dedicated co-processors for network traffic management and seamless forwarding between NVLink and IB domains, such designs can reduce software complexity and maximize bandwidth utilization. For example, node-limited routing strategies employed in DeepSeek-V3 can be further optimized with hardware support for dynamic traffic deduplication. We also recognize emerging interconnect protocols such as the Ultra Ethernet Consortium (UEC) (Consortium, 2023 , 2024 ) , Ultra Accelerator Link (UALink) (CONSORTIUM, 2025 ) , both of which are poised to drive advancements in scal...
4.4. Scale-Up and Scale-Out Convergence
现。这不仅提高了有效带宽,还降低了网络特定操作的计算复杂性。
(4)硬件同步原语:提供细粒度硬件同步机制来处理内存一致性问题。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ntation. This would not only improve effective bandwidth but also reduce the computational complexity of network-specific operations. (4) Hardware Synchronization Primitives: Provide fine-grained hardware synchronization instructions to handle memory consistency issues or out-of-order packet arrivals at the hardware level. This would eliminate the need for software-based synchronization mechanisms like RDMA completion events, which introduce extra latency and increase programming complexity. Memory-semantic communication with an acquire/release mechanism is a promising implementation. By imple...
4.5. Bandwidth Contention and Latency
4.5.1. 局限性:此外,当前硬件缺乏在NVLink和PCIe上为不同类型的流量动态分配带宽的灵活性。例如,推理时传输KV缓存需要高优先级。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 4.5.1. Limitations: Besides, current hardware lacks the flexibility to dynamically allocate bandwidth between different types of traffic on NVLink and PCIe. For example, during inference, transferring KV cache data from CPU memory to GPU can consume tens of GB/s, saturating PCIe bandwidth. If the GPU simultaneously uses IB for EP communication, this contention between KV cache transfers and EP communication can degrade overall performance and cause latency spikes. 4.5.2. Suggestions: • Dynamic NVLink/PCIe Traffic Prioritization: Hardware should support dynamic prioritization of traffic based o...
5. Large Scale Network Driven Design
5. 大规模网络驱动设计
5.1. 网络协同设计:多平面胖树
在DeepSeek-V3训练期间,我们部署了多平面胖树(MPFT)扩展网络,如图3所示。每个节点配备8个GPU和8个IB NIC。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 5.1. Network Co-Design: Multi-Plane Fat-Tree During the training of DeepSeek-V3, we deployed a Multi-Plane Fat-Tree (MPFT) scale-out network, as shown in Figure 3 . Each node is equipped with eight GPUs and eight IB NICs, with each GPU–NIC pair assigned to a distinct network plane. Additionally, each node has a 400 Gbps Ethernet RoCE NIC connected to a separate storage network plane for accessing the 3FS (DeepSeek-AI, 2025c ) distributed file system. In the scale-out network, we used 64-port 400G IB switches, enabling the topology theoretically supports up to 16,384 GPUs while retaining the co...
5. Large Scale Network Driven Design
总体而言,多平面架构在故障隔离、鲁棒性、负载均衡和大规模系统可扩展性方面提供了显著优势。
图4. 理想多平面网络:每个NIC配备多个物理端口。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: rall, the multi-plane architecture offers significant advantages in fault isolation, robustness, load balancing, and large-scale system scalability. Figure 4 . Ideal Multi-Plane Network: Each NIC is equipped with multiple physical ports, each connected to a distinct network plane. A single queue pair (QP) can simultaneously utilize all available ports for transmitting and receiving packets, which necessitates native support for out-of-order placement within the NIC. 5.1.1. Advantages of Multi-Plane Fat-Tree Network • Subset of Multi-Rail Fat-Tree (MRFT): The MPFT topology constitutes a specifi...
5. Large Scale Network Driven Design
所有比较都是可能的。表3. 网络拓扑比较。成本估计来自Slim Fly(SF)论文(Blach等人,2025)的方法论。DF表示规范龙fly拓扑。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: very is possible. Table 3. Network topology comparison. Cost estimates are derived from the methodology in the Slim Fly (SF) paper (Blach et al., 2025 ) . DF denotes the canonical dragonfly topology (De Sensi et al., 2020 ; Kim et al., 2008 ; Rahman et al., 2019 ) . Metric FT2 MPFT FT3 SF DF Endpoints 2,048 16,384 65,536 32,928 261,632 Switches 96 768 5,120 1,568 16,352 Links 2,048 16,384 131,072 32,928 384,272 Cost [M$] 9 72 491 146 1,522 Cost/Endpoint [k$] 4.39 4.39 7.5 4.4 5.8 It is important to note that, due to current 400G NDR InfiniBand limitations, cross-plane communication requires in...
5. Large Scale Network Driven Design
图7,每个GPU在多平面网络中实现了超过40GB/s的高带宽,提供了满足训练需求的可靠性能。图5. 从32到128 GPU的MRFT和MPFT网络的NCCL all-to-all性能。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: gure 7 , each GPU achieves a high bandwidth exceeding 40GB/s in a multi-plane network, providing reliable performance that meets the demands of training. Figure 5 . NCCL all-to-all performance from 32 to 128 GPUs for MRFT and MPFT networks. 2. Training Throughput for DeepSeek-V3 Model : We also compare the training metrics of the DeepSeek-V3 model between MPFT and MRFT in Table 4 . MFU (Model Flops Utilization) is calculated based on BF16 peak performance. Causal MFU only takes into account the flops of the lower triangle of the attention matrix (in line with FlashAttention (Dao et al., 2022 ;...
5. Large Scale Network Driven Design
延迟,使其成为延迟敏感工作负载(如分布式训练和推理)的首选。虽然IB在延迟性能方面优于RDMA over Converged Ethernet(RoCE)。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: r latency, making it the preferred choice for latency-sensitive workloads such as distributed training and inference. Although IB has superior latency performance compared to RDMA over Converged Ethernet (RoCE), it comes with certain limitations: • Cost: IB hardware is significantly more expensive than RoCE solutions, which limits its widespread adoption. • Scalability: IB switches typically support only 64 ports per switch, compared to the 128 ports commonly found in RoCE switches. This restricts the scalability of IB-based clusters, particularly for large-scale deployments. 5.2.2. Recommenda...
5. Large Scale Network Driven Design
通过动态将数据包分散到多条路径来显著增强网络性能。虽然基于手动配置路由表的静态路由可以为特定目的地避免链路冲突。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: gnificantly enhance network performance by dynamically spraying packets across multiple paths. While static routing—based on manually configured route tables—can avoid link conflicts for specific destinations, it lacks flexibility. For large-scale all-to-all communication, adaptive routing offers superior performance and scalability. (3) Improved Traffic Isolation or Congestion Control Mechanisms: Current RoCE switches support only a limited number of priority queues, which are insufficient for complex AI workloads involving concurrent communication patterns such as EP’s all-to-all and DP’s al...
5. Large Scale Network Driven Design
每个GPU准备好数据后,它必须通知CPU代理,然后填充工作请求(WR)的控制信息,并通过门铃机制信号NIC以启动数据传输。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: e GPU has prepared the data, it must notify the CPU proxy, which then populates the control information for the work request (WR) and signals the NIC via a doorbell mechanism to initiate data transmission. This process introduces additional communication overhead. IBGDA addresses this issue by allowing the GPU to directly fill the WR content and write to the RDMA doorbell MMIO address. By managing the entire control plane within the GPU, IBGDA eliminates the significant latency overhead associated with GPU-CPU communication. Moreover, when sending a large number of small packets, the control p...
5.1. Network Co-Design: Multi-Plane Fat-Tree
5.1. 网络协同设计:多平面胖树
在DeepSeek-V3训练期间,我们部署了多平面胖树(MPFT)扩展网络,如图3所示。每个节点配备8个GPU和8个IB NIC,每个GPU-NIC对通过NVLink直接连接。
原文: During the training of DeepSeek-V3, we deployed a Multi-Plane Fat-Tree (MPFT) scale-out network, as shown in Figure 3 . Each node is equipped with eight GPUs and eight IB NICs, with each GPU–NIC pair assigned to a distinct network plane. Additionally, each node has a 400 Gbps Ethernet RoCE NIC connected to a separate storage network plane for accessing the 3FS (DeepSeek-AI, 2025c ) distributed file system. In the scale-out network, we used 64-port 400G IB switches, enabling the topology theoretically supports up to 16,384 GPUs while retaining the cost and latency advantages of a two-layer netw...
5.1. Network Co-Design: Multi-Plane Fat-Tree
显著的故障隔离、鲁棒性、负载均衡和大规模系统可扩展性优势。图4. 理想多平面网络:每个NIC配备多个物理端口,每个连接到不同的交换机。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: nificant advantages in fault isolation, robustness, load balancing, and large-scale system scalability. Figure 4 . Ideal Multi-Plane Network: Each NIC is equipped with multiple physical ports, each connected to a distinct network plane. A single queue pair (QP) can simultaneously utilize all available ports for transmitting and receiving packets, which necessitates native support for out-of-order placement within the NIC. 5.1.1. Advantages of Multi-Plane Fat-Tree Network • Subset of Multi-Rail Fat-Tree (MRFT): The MPFT topology constitutes a specific subset of the broader MRFT architecture. As...
5.1. Network Co-Design: Multi-Plane Fat-Tree
比较。成本估计来自Slim Fly(SF)论文(Blach等人,2025)的方法论。DF表示规范龙fly拓扑(De Sensi等人,2020;Kim等人,2008)。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: omparison. Cost estimates are derived from the methodology in the Slim Fly (SF) paper (Blach et al., 2025 ) . DF denotes the canonical dragonfly topology (De Sensi et al., 2020 ; Kim et al., 2008 ; Rahman et al., 2019 ) . Metric FT2 MPFT FT3 SF DF Endpoints 2,048 16,384 65,536 32,928 261,632 Switches 96 768 5,120 1,568 16,352 Links 2,048 16,384 131,072 32,928 384,272 Cost [M$] 9 72 491 146 1,522 Cost/Endpoint [k$] 4.39 4.39 7.5 4.4 5.8 It is important to note that, due to current 400G NDR InfiniBand limitations, cross-plane communication requires intra-node forwarding, which introduces additio...
5.1. Network Co-Design: Multi-Plane Fat-Tree
超过40GB/s的每个GPU高带宽,在多平面网络中提供满足训练需求的可靠性能。图5. 从32到128 GPU的MRFT和MPFT网络的NCCL all-to-all性能。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: xceeding 40GB/s in a multi-plane network, providing reliable performance that meets the demands of training. Figure 5 . NCCL all-to-all performance from 32 to 128 GPUs for MRFT and MPFT networks. 2. Training Throughput for DeepSeek-V3 Model : We also compare the training metrics of the DeepSeek-V3 model between MPFT and MRFT in Table 4 . MFU (Model Flops Utilization) is calculated based on BF16 peak performance. Causal MFU only takes into account the flops of the lower triangle of the attention matrix (in line with FlashAttention (Dao et al., 2022 ; Dao, 2023 ) ), while non-causal MFU includes...
5.2. Low Latency Networks
5.2. 低延迟网络
在我们的模型推理中,大规模EP严重依赖all-to-all通信,这对带宽和延迟都高度敏感。考虑第2.3.2节中讨论的典型场景。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In our model inference, large-scale EP relies heavily on all-to-all communication, which is highly sensitive to both bandwidth and latency. Consider a typical scenario discussed in Section 2.3.2 , with a network bandwidth of 50GB/s, the data transfer should ideally take approximately 120 μ s 120\leavevmode\nobreak\ \mu\mathrm{s} . Therefore, the intrinsic network latencies on the order of microseconds can critically impact system performance, making their effects non-negligible. 5.2.1. IB or RoCE As shown in Table 5 , IB consistently achieves lower latency, making it the preferred choice f...
5.2. Low Latency Networks
我们期待在这一方向上继续创新。
(2)优化路由策略:如图8所示,RoCE中的默认等成本多路径(ECMP)路由策略在分布流量方面存在困难。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: are looking forward to continuing innovation in this direction. (2) Optimized Route Policy: As shown in Figure 8 , the default Equal-Cost Multi-Path (ECMP) routing policy in RoCE struggles to distribute traffic efficiently across interconnects, leading to severe congestion performance degradation in NCCL collective communication tests. LLM training traffic, such as in DP (Data Parallelism), tends to lack randomness, causing multiple flows to converge on the same interconnect link. In contrast, Adaptive Routing (AR) (Geoffray and Hoefler, 2008 ) can significantly enhance network performance by ...
5.2. Low Latency Networks
0.29 0.31 TFLOPS(非因果)432 432 TFLOPS(因果)385 385 MFU(非因果)43.73% 43.68% MFU(因果)38.94% 38.90% 表5. IB、RoCE和节点内CPU端端到端延迟比较。
原文: 0.29 0.31 TFLOPS (non-causal) 432 432 TFLOPS (causal) 385 385 MFU (non-causal) 43.73% 43.68% MFU (causal) 38.94% 38.90% Table 5. CPU side end-to-end latency comparison between IB, RoCE, and intra-node NVLink for 64B data transmission. Link Layer Same Leaf Cross Leaf RoCE 3.6us 5.6us InfiniBand 2.8us 3.7us NVLink 3.33us - 5.2.3. InfiniBand GPUDirect Async (IBGDA) We utilize IBGDA (NVIDIA, 2022 ; Agostini et al., 2018 ) to reduce latency in network communications. Traditionally, network communication involves the creation of a CPU proxy thread: once the GPU has prepared the data, it must notify ...
6. Discussion and Insights for Future Hardware Architecture Design
6. 对未来硬件架构的讨论与洞察
基于前面几节,我们总结了关键架构洞察并概述了面向大规模AI工作负载的硬件设计未来方向。第2.3.2节强调了推理延迟的重要性。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Building on the previous sections, we summarize key architectural insights and outline future directions for hardware design tailored to large-scale AI workloads. Section 2.3.2 highlighted the importance of large-scale scale-up networks for accelerating model inference. Section 3 discussed the necessity of efficient support for low-precision computation and communication. Section 4 explored the convergence of scale-up and scale-out architectures, along with several proposed enhancements. Section 5 focused on multi-plane network topologies and identified key improvements needed for Ethernet-bas...
6. Discussion and Insights for Future Hardware Architecture Design
不足以确保系统级鲁棒性。
6.1.2. 高级错误检测与纠正建议
为降低静默损坏相关风险,硬件必须整合先进的错误检测和纠正机制。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: nsufficient for ensuring system-wide robustness. 6.1.2. Suggestions for Advanced Error Detection and Correction To mitigate risks associated with silent corruption, hardware must incorporate advanced error detection mechanisms beyond traditional ECC. Techniques such as checksum-based validation or hardware-accelerated redundancy checks can provide higher reliability for large-scale deployments. Furthermore, hardware vendors should deliver comprehensive diagnostic toolkits to end users, empowering them to rigorously verify the integrity of their systems and proactively identify any latent silen...
6. Discussion and Insights for Future Hardware Architecture Design
超过4GHz。此外,现代AI工作负载需要足够的每GPU CPU核心以防止控制端瓶颈。对于chiplet架构,需要额外核心支持缓存一致性。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ove 4 GHz. Furthermore, modern AI workloads require sufficient CPU cores per GPU to prevent control-side bottlenecks. For chiplet-based architectures, additional cores are needed to support cache-aware workload partitioning and isolation. 6.3. Toward Intelligent Networks for AI To meet the demands of latency-sensitive workloads, future interconnects must prioritize both low latency and intelligent networks: • Co-Packaged Optics: Incorporating silicon photonics enables scalable higher bandwidth scalability and enhanced energy efficiency, both are critical for large-scale distributed systems. • ...
6. Discussion and Insights for Future Hardware Architecture Design
流量优先级。例如,统一集群中的推理任务应与训练流量隔离,确保延迟敏感应用的响应性。
6.4. 内存语义通信讨论 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: traffic prioritization. For example, inference tasks should be isolated from training traffic in unified clusters, ensuring responsiveness for latency-sensitive applications. 6.4. Discussion on Memory-Semantic Communication and Ordering Issue Inter-node communication using load/store memory semantics is efficient and programmer-friendly, but current implementations are hampered by memory ordering challenges. For example, after writing data, the sender must issue an explicit memory barrier (fence) before updating a flag to notify the receiver, ensuring data consistency. This strict ordering int...
6. Discussion and Insights for Future Hardware Architecture Design
不仅语义操作还有消息语义RDMA原语,从而扩大其实际应用范围。
6.5. 网络内计算与压缩
EP涉及两个关键all-to-all阶段——分发和合并——为网络内优化提供了重要机会。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: emantic operations but also message-semantic RDMA primitives, thus broadening its practical applicability. 6.5. In-Network Computation and Compression EP involves two critical all-to-all stages— dispatch and combine —that present significant opportunities for in-network optimization. The dispatch stage resembles a small-scale multicast operation, where a single message must be forwarded to multiple target devices. A hardware-level protocol enabling automatic packet replication and forwarding to multiple destinations could drastically reduce communication overhead and improve efficiency. The co...
6. Discussion and Insights for Future Hardware Architecture Design
计算瓶颈。SeDRAM(Wang等人,2023)等架构展示了这一方法的潜力,为内存受限工作负载提供前所未有的性能。
• 晶圆级系统(SoW):晶圆级集成提供了新的互连可能性。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: cal bottleneck. Architectures such as SeDRAM (Wang et al., 2023 ) exemplify the potential of this approach, delivering unprecedented performance for memory-bound workloads. • System-on-Wafer (SoW): Wafer-scale integration (Lie, 2022 ) can maximize computational density and memory bandwidth, addressing the needs of ultra-large-scale models.
6.1. Robustness Challenges
6.1. 鲁棒性挑战
6.1.1. 局限性:
• 互连故障:高性能互连(如IB和NVLink)容易出现间歇性断开,可能中断节点间通信。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 6.1.1. Limitations: • Interconnect Failures: High-performance interconnects (e.g., IB and NVLink) are prone to intermittent disconnections, which can disrupt node-to-node communication. This is especially harmful in communication-heavy workloads like EP, where even brief interruptions may lead to significant performance drops or job failures. • Single Hardware Failures: Node crashes, GPU failures, or ECC (Error-Correcting Code) memory errors can compromise long-running training jobs, often requiring costly restarts. The impact of such failures escalates in large-scale deployments, where the pr...
6.2. CPU Bottlenecks and Interconnects
6.2. CPU瓶颈与互连
虽然加速器设计通常占据中心舞台,CPU在协调计算、管理I/O和维持系统吞吐量方面仍然必不可少。然而,当前架构面临几个关键挑战。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: While accelerator design often takes center stage, CPUs remain essential for coordinating computation, managing I/O, and sustaining system throughput. However, current architectures face several critical bottlenecks: First, as discussed in Section 4.5 , the PCIe interface between CPUs and GPUs often becomes a bandwidth bottleneck, particularly during large-scale parameter, gradient, or KV cache transfers. To mitigate this, future systems should adopt direct CPU–GPU interconnects—such as NVLink or Infinity Fabric—or integrate both CPUs and GPUs into the scale-up domain, thereby eliminating intr...
6.3. Toward Intelligent Networks for AI
6.3. 面向AI的智能网络
为满足延迟敏感工作负载需求,未来互连必须优先考虑低延迟和智能网络:
• 共封装光学:整合硅光子学可实现可扩展的低延迟光互连。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: To meet the demands of latency-sensitive workloads, future interconnects must prioritize both low latency and intelligent networks: • Co-Packaged Optics: Incorporating silicon photonics enables scalable higher bandwidth scalability and enhanced energy efficiency, both are critical for large-scale distributed systems. • Lossless Network : Credit-Based Flow Control (CBFC) mechanisms ensures lossless data transmission, yet naively triggering flow control can induce severe head-of-line blocking. Therefore, it is imperative to deploy advanced, endpoint-driven congestion control (CC) algorithms that...
6.4. Discussion on Memory-Semantic Communication and Ordering Issue
6.4. 内存语义通信讨论
使用负载/存储内存语义的节点间通信高效且对程序员友好,但当前实现在内存排序方面存在挑战。例如,写入数据后需要确保可见性。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Inter-node communication using load/store memory semantics is efficient and programmer-friendly, but current implementations are hampered by memory ordering challenges. For example, after writing data, the sender must issue an explicit memory barrier (fence) before updating a flag to notify the receiver, ensuring data consistency. This strict ordering introduces additional round-trip time (RTT) latency and can stall the issuing thread, impeding inflight stores and reducing throughput. Similar out-of-order synchronization issues arise in message-semantic RDMA; for instance, performing RDMA atom...
6.5. In-Network Computation and Compression
6.5. 网络内计算与压缩
EP涉及两个关键all-to-all阶段——分发和合并——为网络内优化提供了重要机会。分发阶段类似于小规模组播操作。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: EP involves two critical all-to-all stages— dispatch and combine —that present significant opportunities for in-network optimization. The dispatch stage resembles a small-scale multicast operation, where a single message must be forwarded to multiple target devices. A hardware-level protocol enabling automatic packet replication and forwarding to multiple destinations could drastically reduce communication overhead and improve efficiency. The combine stage, acting as a small-scale reduction operation, could benefit from in-network aggregation techniques. However, due to the small reduction sco...
6.6. Memory-Centric Innovations
6.6. 内存中心创新
6.6.1. 内存带宽局限性
模型规模的指数级增长已超过高带宽内存(HBM)技术的进步。这种差异造成了内存瓶颈,特别是在推理场景。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 6.6.1. Limitations of Memory Bandwidth The exponential growth in model sizes has outpaced advancements in high-bandwidth memory (HBM) technology. This disparity creates a memory bottleneck, particularly in attention-heavy architectures like Transformers. 6.6.2. Suggestions: • DRAM-Stacked Accelerators: Leveraging advanced 3D stacking technologies, DRAM dies can be vertically integrated atop a logic die, thereby enabling exceptionally high memory bandwidth, ultra-low latency, and a practical memory capacity (though stack-limited). This architectural paradigm proves remarkably advantageous for u...
7. Conclusion
7. 结论
DeepSeek-V3展示了硬件软件协同设计在推进大规模AI系统可扩展性、效率和鲁棒性方面的变革潜力。通过解决当前硬件局限性,我们为未来AI基础设施设计提供了可操作的见解。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: DeepSeek-V3 exemplifies the transformative potential of hardware-software co-design in advancing the scalability, efficiency, and robustness of large-scale AI systems. By addressing the limitations of current hardware architectures and proposing actionable recommendations, this paper provides a roadmap for the next generation of AI-optimized hardware. These innovations will be critical as AI workloads continue to grow in complexity and scale, driving the future of intelligent systems.