DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-V2：强大、经济、高效的混合专家语言模型

📅 2024-05-07👤 DeepSeek Team📄 arXiv: 2405.04434📊 中等

MoE混合注意力多Token预测基础模型

中文摘要

DeepSeek-V2 采用大规模 MoE 架构，包含 236B 总参数但仅激活 21B，结合 Multi-token Prediction 和 DeepSeekMoE 架构创新，实现推理速度更快、成本更低。辅助路由机制（Auxiliary Loss Routing）有效缓解了 MoE 中的专家负载不均问题。Multi-token Prediction 通过预测未来多个 token 加速训练过程。

DeepSeek-V2 uses a large-scale MoE architecture with 236B total parameters but only 21B activated, combining Multi-token Prediction and DeepSeekMoE innovations for faster, cheaper inference.

快速链接

PDF 下载 arXiv 原文 GitHub 查看翻译 (79%)

核心贡献

236B 总参数但仅激活 21B 的大规模 MoE 架构
Multi-token Prediction 通过预测未来多个 token 加速训练过程
辅助路由机制（Auxiliary Loss Routing）有效缓解 MoE 中的专家负载不均问题
推理速度更快、成本更低

技术细节

架构	236B MoE（21B 激活）+ Multi-token Prediction + Auxiliary Loss Routing
核心创新	大规模 MoE + Multi-token Prediction + 辅助路由
性能	推理速度更快、成本更低
效率	激活参数仅占总参数的 ~9%

💡 阅读建议

DeepSeek 系列的关键论文。重点理解 MoE 架构的三个核心技术：Multi-token Prediction、辅助路由、稀疏激活。