DeepSeek-V2-Lite Open-Source Language Model - Cost-Effective, Supports 32k Context Length

Deepseek V2 Lite

Developed by ZZichen

DeepSeek-V2-Lite is a cost-efficient Mixture of Experts (MoE) language model with a total of 16B parameters and 2.4B active parameters, supporting a 32k context length.

Large Language Model

Transformers

#Mixture of Experts Architecture #Efficient Inference Optimization #Bilingual Chinese-English Model

Downloads 20

Release Time : 5/31/2024

Model Overview

DeepSeek-V2-Lite is a powerful Mixture of Experts (MoE) language model that adopts innovative Multi-Head Latent Attention (MLA) and DeepSeekMoE architecture, designed to provide cost-efficient training and inference performance.

Model Features

Multi-Head Latent Attention (MLA)

Eliminates the bottleneck of key-value caching during inference through low-rank key-value joint compression, enabling efficient inference.

DeepSeekMoE Architecture

Adopts a high-performance MoE architecture, capable of training stronger models at lower costs.

Cost-Efficient Training and Inference

With 16B total parameters and 2.4B active parameters, it can be deployed on a single 40G GPU.

Model Capabilities

Text generation

Dialogue systems

Code generation

Mathematical reasoning

Chinese processing

English processing

Use Cases

Natural Language Processing

Text Completion

Used for generating coherent text completions, suitable for writing assistance, content generation, and other scenarios.

Dialogue Systems

Builds intelligent conversational assistants, supporting multi-turn dialogues and complex Q&A.

Code Generation

Code Completion

Generates high-quality code snippets, supporting multiple programming languages.

Scored 29.9 on the HumanEval benchmark.

Mathematical Reasoning

Mathematical Problem Solving

Solves complex mathematical problems, including algebra and geometry.

Scored 41.1 on the GSM8K benchmark.

🚀 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-V2 is a powerful Mixture-of-Experts (MoE) language model that offers economical training and efficient inference. It has sparked widespread interest in the MLA (Multi-head Latent Attention) mechanism. This model provides a cost - effective and high - performance solution for natural language processing tasks.

Paper LinküëÅÔ∏è

🚀 Quick Start

Last week, the release of DeepSeek-V2 generated significant interest in MLA (Multi-head Latent Attention). In response to community requests, DeepSeek-V2-Lite is now available:

It has 16B total parameters, 2.4B active parameters, and was trained from scratch with 5.7T tokens.
It outperforms 7B dense and 16B MoE models on many English and Chinese benchmarks.
It can be deployed on a single 40G GPU and fine - tuned on 8x80G GPUs.

✨ Features

Innovative Architectures: DeepSeek-V2 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA compresses the Key - Value (KV) cache into a latent vector for efficient inference, while DeepSeekMoE enables cost - effective training through sparse computation.
High Performance: Demonstrates superior performance on various benchmarks in both English and Chinese, as well as in code and math tasks.
Scalability and Efficiency: Can be deployed on a single 40G GPU and fine - tuned on multi - GPU setups.

📦 Installation

The models are open - sourced on Hugging Face. You can download them from the following links:

Model	#Total Params	#Activated Params	Context Length	Download
DeepSeek-V2-Lite	16B	2.4B	32k	ü§ó HuggingFace
DeepSeek-V2-Lite-Chat (SFT)	16B	2.4B	32k	ü§ó HuggingFace
DeepSeek-V2	236B	21B	128k	ü§ó HuggingFace
DeepSeek-V2-Chat (RL)	236B	21B	128k	ü§ó HuggingFace

Note that due to Hugging Face constraints, the open - source code may have slower performance on GPUs compared to the internal codebase. A dedicated vllm solution is provided for better performance.

💻 Usage Examples

Basic Usage

Text Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Lite"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Chat Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

The complete chat template can be found in tokenizer_config.json in the Hugging Face model repository. An example of the chat template is:

<ÔΩúbegin‚ñÅof‚ñÅsentenceÔΩú>User: {user_message_1}

Assistant: {assistant_message_1}<ÔΩúend‚ñÅo

📚 Documentation

Evaluation Results

Base Model

Standard Benchmark

Benchmark	Domain	DeepSeek 7B (Dense)	DeepSeekMoE 16B	DeepSeek-V2-Lite (MoE-16B)
Architecture	-	MHA+Dense	MHA+MoE	MLA+MoE
MMLU	English	48.2	45.0	58.3
BBH	English	39.5	38.9	44.1
C-Eval	Chinese	45.0	40.6	60.3
CMMLU	Chinese	47.2	42.5	64.3
HumanEval	Code	26.2	26.8	29.9
MBPP	Code	39.0	39.2	43.2
GSM8K	Math	17.4	18.8	41.1
Math	Math	3.3	4.3	17.1

For more evaluation details, such as few - shot settings and prompts, please check the paper.

Chat Model

Standard Benchmark

Benchmark	Domain	DeepSeek 7B Chat (SFT)	DeepSeekMoE 16B Chat (SFT)	DeepSeek-V2-Lite 16B Chat (SFT)
MMLU	English	49.7	47.2	55.7
BBH	English	43.1	42.2	48.1
C-Eval	Chinese	44.7	40.0	60.1
CMMLU	Chinese	51.2	49.3	62.5
HumanEval	Code	45.1	45.7	57.3
MBPP	Code	39.0	46.2	45.8
GSM8K	Math	62.6	62.2	72.0
Math	Math	14.7	15.2	27.9

Model Architecture

DeepSeek-V2 uses innovative architectures for cost - effective training and efficient inference:

Multi - head Latent Attention (MLA): Compresses the Key - Value (KV) cache into a latent vector, eliminating the bottleneck of inference - time KV cache.
DeepSeekMoE: A high - performance MoE architecture that enables the training of stronger models at lower costs.

DeepSeek-V2-Lite has 27 layers, a hidden dimension of 2048, 16 attention heads with a head dimension of 128, and a KV compression dimension of 512. It does not compress queries, and the per - head dimension for decoupled queries and keys is 64. All FFNs except the first layer are replaced with MoE layers, with each MoE layer consisting of 2 shared experts and 64 routed experts, and the intermediate hidden dimension of each expert is 1408. Six experts are activated for each token.

Training Details

DeepSeek-V2-Lite was trained from scratch on the same pre - training corpus as DeepSeek-V2, without any SFT data pollution. It uses the AdamW optimizer with specific hyperparameters. The learning rate is scheduled using a warm - up and step - decay strategy. It was trained with a constant batch size of 4608 sequences, a maximum sequence length of 4K, and 5.7T tokens. Pipeline parallelism was used for deployment, and after pre - training, long - context extension and SFT were performed to obtain the chat model DeepSeek-V2-Lite Chat.

🔧 Technical Details

Model Training: The model was trained from scratch with 5.7T tokens, using the AdamW optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.95$, and $\mathrm{weight_decay}=0.1$. The learning rate was scheduled with a warm - up and step - decay strategy.
Inference: MLA compresses the KV cache, reducing memory requirements and enabling efficient inference.
Parallelism: Pipeline parallelism was used for model deployment, with a small expert - level balance loss of $\alpha_{1}=0.001$.

📄 License

The code is licensed under the MIT license, and the model is licensed under the Model Agreement. You can find the license details in the following links:

📚 Citation

If you use this work, please cite our paper: Paper Link

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご