DeepSeek-V2-Lite开源语言模型 - 经济高效，支持32k上下文长度

首页

Deepseek V2 Lite

由 ZZichen 开发

DeepSeek-V2-Lite 是一款经济高效的专家混合（MoE）语言模型，总参数量16B，激活参数量2.4B，支持32k上下文长度。

大型语言模型

Transformers

#专家混合架构 #高效推理优化 #中英双语模型

下载量 20

发布时间 : 5/31/2024

模型简介

DeepSeek-V2-Lite 是一款强大的专家混合（MoE）语言模型，采用创新的多头潜在注意力（MLA）和DeepSeekMoE架构，旨在提供经济高效的训练和推理性能。

模型特点

多头潜在注意力（MLA）

通过低秩键值联合压缩消除推理时键值缓存的瓶颈，支持高效推理。

DeepSeekMoE架构

采用高性能MoE架构，能以更低成本训练更强模型。

经济高效的训练和推理

总参数量16B，激活参数量2.4B，可在单块40G GPU上部署。

模型能力

文本生成

对话系统

代码生成

数学推理

中文处理

英文处理

使用案例

自然语言处理

文本补全

用于生成连贯的文本补全，适用于写作辅助、内容生成等场景。

对话系统

构建智能对话助手，支持多轮对话和复杂问答。

代码生成

代码补全

生成高质量的代码片段，支持多种编程语言。

在HumanEval测试中得分29.9。

数学推理

数学问题求解

解决复杂的数学问题，包括代数、几何等。

在GSM8K测试中得分41.1。

🚀 DeepSeek-V2：强大、经济且高效的混合专家语言模型

DeepSeek-V2 是一款强大的混合专家（MoE）语言模型，具有经济的训练成本和高效的推理能力。它采用了包括多头潜在注意力（MLA）和 DeepSeekMoE 在内的创新架构，为自然语言处理领域带来了新的突破。

🚀 快速开始

模型下载：DeepSeek-V2 开放了两种规模的基础模型和对话模型。

模型	总参数数量	激活参数数量	上下文长度	下载地址
DeepSeek-V2-Lite	16B	2.4B	32k	🤗 HuggingFace
DeepSeek-V2-Lite-Chat (SFT)	16B	2.4B	32k	🤗 HuggingFace
DeepSeek-V2	236B	21B	128k	🤗 HuggingFace
DeepSeek-V2-Chat (RL)	236B	21B	128k	🤗 HuggingFace

本地运行：使用 BF16 格式的 DeepSeek-V2-Lite 进行推理需要 40GB * 1 的 GPU。
- 使用 Huggingface 的 Transformers 进行推理
  - 文本补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Lite"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

    - **对话补全**

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

- **使用 vLLM 进行推理（推荐）**

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 1
model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

- **LangChain 支持**

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    model='deepseek-chat',
    openai_api_key=<your-deepseek-api-key>,
    openai_api_base='https://api.deepseek.com/v1',
    temperature=0.85,
    max_tokens=8000)

✨ 主要特性

参数规模与训练数据：DeepSeek-V2-Lite 总参数 16B，激活参数 2.4B，使用 5.7T 标记从头开始训练。
性能表现：在许多中英文基准测试中优于 7B 密集模型和 16B MoE 模型。
部署灵活性：可部署在单张 40G GPU 上，也可在 8x80G GPU 上进行微调。
创新架构：采用多头潜在注意力（MLA）和 DeepSeekMoE 架构，实现经济训练和高效推理。

📦 安装指南

由于 HuggingFace 的限制，当前开源代码在使用 Huggingface 在 GPU 上运行时性能比内部代码库慢。为了高效运行模型，提供了专门的 vllm 解决方案。

💻 使用示例

基础用法

上述文本补全、对话补全、vLLM 推理和 LangChain 支持的代码示例展示了模型的基础使用方法。

高级用法

可根据具体需求调整模型的超参数，如温度、最大生成标记数等，以获得不同风格和长度的生成结果。

📚 详细文档

模型下载

提供了不同规模的基础模型和对话模型的下载地址。

评估结果

基础模型 | 基准测试 | 领域 | DeepSeek 7B (密集) | DeepSeekMoE 16B | DeepSeek-V2-Lite (MoE-16B) | |:-------------:|:----------:|:--------------:|:-----------------:|:--------------------------:| | 架构 | - | MHA+密集 | MHA+MoE | MLA+MoE | | MMLU | 英语 | 48.2 | 45.0 | 58.3 | | BBH | 英语 | 39.5 | 38.9 | 44.1 | | C-Eval | 中文 | 45.0 | 40.6 | 60.3 | | CMMLU | 中文 | 47.2 | 42.5 | 64.3 | | HumanEval | 代码 | 26.2 | 26.8 | 29.9 | | MBPP | 代码 | 39.0 | 39.2 | 43.2 | | GSM8K | 数学 | 17.4 | 18.8 | 41.1 | | Math | 数学 | 3.3 | 4.3 | 17.1 |
对话模型 | 基准测试 | 领域 | DeepSeek 7B 对话 (SFT) | DeepSeekMoE 16B 对话 (SFT) | DeepSeek-V2-Lite 16B 对话 (SFT) | |:-----------:|:----------------:|:------------------:|:---------------:|:---------------------:| | MMLU | 英语 | 49.7 | 47.2 | 55.7 | | BBH | 英语 | 43.1 | 42.2 | 48.1 | | C-Eval | 中文 | 44.7 | 40.0 | 60.1 | | CMMLU | 中文 | 51.2 | 49.3 | 62.5 | | HumanEval | 代码 | 45.1 | 45.7 | 57.3 | | MBPP | 代码 | 39.0 | 46.2 | 45.8 | | GSM8K | 数学 | 62.6 | 62.2 | 72.0 | | Math | 数学 | 14.7 | 15.2 | 27.9 |

🔧 技术细节

模型架构

注意力机制：设计了 MLA（多头潜在注意力），通过将键值（KV）缓存显著压缩为潜在向量，保证了高效推理。
前馈网络（FFNs）：采用 DeepSeekMoE 架构，通过稀疏计算以经济的成本训练强大的模型。

训练细节

DeepSeek-V2-Lite 在与 DeepSeek-V2 相同的预训练语料库上从头开始训练，未受任何 SFT 数据污染。使用 AdamW 优化器，学习率采用热身和步长衰减策略。训练时最大序列长度为 4K，在 5.7T 标记上进行训练。预训练后进行长上下文扩展和 SFT 得到对话模型。

📄 许可证

代码仓库遵循 MIT 许可证，DeepSeek-V2 基础/对话模型的使用遵循模型许可证，支持商业使用。

引用

@misc{deepseekv2,
      title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, 
      author={DeepSeek-AI},
      year={2024},
      eprint={2405.04434},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}