🚀 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 is a powerful Mixture-of-Experts (MoE) language model that offers economical training and efficient inference. It has sparked widespread interest in the MLA (Multi-head Latent Attention) mechanism. This model provides a cost - effective and high - performance solution for natural language processing tasks.
Model Download |
Evaluation Results |
Model Architecture |
API Platform |
License |
Citation
Paper LinküëÅÔ∏è
🚀 Quick Start
Last week, the release of DeepSeek-V2 generated significant interest in MLA (Multi-head Latent Attention). In response to community requests, DeepSeek-V2-Lite is now available:
- It has 16B total parameters, 2.4B active parameters, and was trained from scratch with 5.7T tokens.
- It outperforms 7B dense and 16B MoE models on many English and Chinese benchmarks.
- It can be deployed on a single 40G GPU and fine - tuned on 8x80G GPUs.
✨ Features
- Innovative Architectures: DeepSeek-V2 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA compresses the Key - Value (KV) cache into a latent vector for efficient inference, while DeepSeekMoE enables cost - effective training through sparse computation.
- High Performance: Demonstrates superior performance on various benchmarks in both English and Chinese, as well as in code and math tasks.
- Scalability and Efficiency: Can be deployed on a single 40G GPU and fine - tuned on multi - GPU setups.
📦 Installation
The models are open - sourced on Hugging Face. You can download them from the following links:
Note that due to Hugging Face constraints, the open - source code may have slower performance on GPUs compared to the internal codebase. A dedicated vllm solution is provided for better performance.
💻 Usage Examples
Basic Usage
Text Completion
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "deepseek-ai/DeepSeek-V2-Lite"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Chat Completion
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
messages = [
{"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)
The complete chat template can be found in tokenizer_config.json
in the Hugging Face model repository. An example of the chat template is:
<|begin▁of▁sentence|>User: {user_message_1}
Assistant: {assistant_message_1}<|end▁o
📚 Documentation
Evaluation Results
Base Model
Standard Benchmark
Benchmark |
Domain |
DeepSeek 7B (Dense) |
DeepSeekMoE 16B |
DeepSeek-V2-Lite (MoE-16B) |
Architecture |
- |
MHA+Dense |
MHA+MoE |
MLA+MoE |
MMLU |
English |
48.2 |
45.0 |
58.3 |
BBH |
English |
39.5 |
38.9 |
44.1 |
C-Eval |
Chinese |
45.0 |
40.6 |
60.3 |
CMMLU |
Chinese |
47.2 |
42.5 |
64.3 |
HumanEval |
Code |
26.2 |
26.8 |
29.9 |
MBPP |
Code |
39.0 |
39.2 |
43.2 |
GSM8K |
Math |
17.4 |
18.8 |
41.1 |
Math |
Math |
3.3 |
4.3 |
17.1 |
For more evaluation details, such as few - shot settings and prompts, please check the paper.
Chat Model
Standard Benchmark
Benchmark |
Domain |
DeepSeek 7B Chat (SFT) |
DeepSeekMoE 16B Chat (SFT) |
DeepSeek-V2-Lite 16B Chat (SFT) |
MMLU |
English |
49.7 |
47.2 |
55.7 |
BBH |
English |
43.1 |
42.2 |
48.1 |
C-Eval |
Chinese |
44.7 |
40.0 |
60.1 |
CMMLU |
Chinese |
51.2 |
49.3 |
62.5 |
HumanEval |
Code |
45.1 |
45.7 |
57.3 |
MBPP |
Code |
39.0 |
46.2 |
45.8 |
GSM8K |
Math |
62.6 |
62.2 |
72.0 |
Math |
Math |
14.7 |
15.2 |
27.9 |
Model Architecture
DeepSeek-V2 uses innovative architectures for cost - effective training and efficient inference:
- Multi - head Latent Attention (MLA): Compresses the Key - Value (KV) cache into a latent vector, eliminating the bottleneck of inference - time KV cache.
- DeepSeekMoE: A high - performance MoE architecture that enables the training of stronger models at lower costs.
DeepSeek-V2-Lite has 27 layers, a hidden dimension of 2048, 16 attention heads with a head dimension of 128, and a KV compression dimension of 512. It does not compress queries, and the per - head dimension for decoupled queries and keys is 64. All FFNs except the first layer are replaced with MoE layers, with each MoE layer consisting of 2 shared experts and 64 routed experts, and the intermediate hidden dimension of each expert is 1408. Six experts are activated for each token.
Training Details
DeepSeek-V2-Lite was trained from scratch on the same pre - training corpus as DeepSeek-V2, without any SFT data pollution. It uses the AdamW optimizer with specific hyperparameters. The learning rate is scheduled using a warm - up and step - decay strategy. It was trained with a constant batch size of 4608 sequences, a maximum sequence length of 4K, and 5.7T tokens. Pipeline parallelism was used for deployment, and after pre - training, long - context extension and SFT were performed to obtain the chat model DeepSeek-V2-Lite Chat.
🔧 Technical Details
- Model Training: The model was trained from scratch with 5.7T tokens, using the AdamW optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.95$, and $\mathrm{weight_decay}=0.1$. The learning rate was scheduled with a warm - up and step - decay strategy.
- Inference: MLA compresses the KV cache, reducing memory requirements and enabling efficient inference.
- Parallelism: Pipeline parallelism was used for model deployment, with a small expert - level balance loss of $\alpha_{1}=0.001$.
📄 License
The code is licensed under the MIT license, and the model is licensed under the Model Agreement. You can find the license details in the following links:
📚 Citation
If you use this work, please cite our paper: Paper Link