DeepSeek-V2-Lite開源語言模型 - 經濟高效，支持32k上下文長度

首頁

Deepseek V2 Lite

由ZZichen開發

DeepSeek-V2-Lite 是一款經濟高效的專家混合（MoE）語言模型，總參數量16B，激活參數量2.4B，支持32k上下文長度。

大型語言模型

Transformers

#專家混合架構 #高效推理優化 #中英雙語模型

下載量 20

發布時間 : 5/31/2024

模型概述

DeepSeek-V2-Lite 是一款強大的專家混合（MoE）語言模型，採用創新的多頭潛在注意力（MLA）和DeepSeekMoE架構，旨在提供經濟高效的訓練和推理性能。

模型特點

多頭潛在注意力（MLA）

通過低秩鍵值聯合壓縮消除推理時鍵值緩存的瓶頸，支持高效推理。

DeepSeekMoE架構

採用高性能MoE架構，能以更低成本訓練更強模型。

經濟高效的訓練和推理

總參數量16B，激活參數量2.4B，可在單塊40G GPU上部署。

模型能力

文本生成

對話系統

代碼生成

數學推理

中文處理

英文處理

使用案例

自然語言處理

文本補全

用於生成連貫的文本補全，適用於寫作輔助、內容生成等場景。

對話系統

構建智能對話助手，支持多輪對話和複雜問答。

代碼生成

代碼補全

生成高質量的代碼片段，支持多種編程語言。

在HumanEval測試中得分29.9。

數學推理

數學問題求解

解決複雜的數學問題，包括代數、幾何等。

在GSM8K測試中得分41.1。

🚀 DeepSeek-V2：強大、經濟且高效的混合專家語言模型

DeepSeek-V2 是一款強大的混合專家（MoE）語言模型，具有經濟的訓練成本和高效的推理能力。它採用了包括多頭潛在注意力（MLA）和 DeepSeekMoE 在內的創新架構，為自然語言處理領域帶來了新的突破。

🚀 快速開始

模型下載：DeepSeek-V2 開放了兩種規模的基礎模型和對話模型。

模型	總參數數量	激活參數數量	上下文長度	下載地址
DeepSeek-V2-Lite	16B	2.4B	32k	🤗 HuggingFace
DeepSeek-V2-Lite-Chat (SFT)	16B	2.4B	32k	🤗 HuggingFace
DeepSeek-V2	236B	21B	128k	🤗 HuggingFace
DeepSeek-V2-Chat (RL)	236B	21B	128k	🤗 HuggingFace

本地運行：使用 BF16 格式的 DeepSeek-V2-Lite 進行推理需要 40GB * 1 的 GPU。
- 使用 Huggingface 的 Transformers 進行推理
  - 文本補全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Lite"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

    - **對話補全**

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

- **使用 vLLM 進行推理（推薦）**

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 1
model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

- **LangChain 支持**

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    model='deepseek-chat',
    openai_api_key=<your-deepseek-api-key>,
    openai_api_base='https://api.deepseek.com/v1',
    temperature=0.85,
    max_tokens=8000)

✨ 主要特性

參數規模與訓練數據：DeepSeek-V2-Lite 總參數 16B，激活參數 2.4B，使用 5.7T 標記從頭開始訓練。
性能表現：在許多中英文基準測試中優於 7B 密集模型和 16B MoE 模型。
部署靈活性：可部署在單張 40G GPU 上，也可在 8x80G GPU 上進行微調。
創新架構：採用多頭潛在注意力（MLA）和 DeepSeekMoE 架構，實現經濟訓練和高效推理。

📦 安裝指南

由於 HuggingFace 的限制，當前開源代碼在使用 Huggingface 在 GPU 上運行時性能比內部代碼庫慢。為了高效運行模型，提供了專門的 vllm 解決方案。

💻 使用示例

基礎用法

上述文本補全、對話補全、vLLM 推理和 LangChain 支持的代碼示例展示了模型的基礎使用方法。

高級用法

可根據具體需求調整模型的超參數，如溫度、最大生成標記數等，以獲得不同風格和長度的生成結果。

📚 詳細文檔

模型下載

提供了不同規模的基礎模型和對話模型的下載地址。

評估結果

基礎模型 | 基準測試 | 領域 | DeepSeek 7B (密集) | DeepSeekMoE 16B | DeepSeek-V2-Lite (MoE-16B) | |:-------------:|:----------:|:--------------:|:-----------------:|:--------------------------:| | 架構 | - | MHA+密集 | MHA+MoE | MLA+MoE | | MMLU | 英語 | 48.2 | 45.0 | 58.3 | | BBH | 英語 | 39.5 | 38.9 | 44.1 | | C-Eval | 中文 | 45.0 | 40.6 | 60.3 | | CMMLU | 中文 | 47.2 | 42.5 | 64.3 | | HumanEval | 代碼 | 26.2 | 26.8 | 29.9 | | MBPP | 代碼 | 39.0 | 39.2 | 43.2 | | GSM8K | 數學 | 17.4 | 18.8 | 41.1 | | Math | 數學 | 3.3 | 4.3 | 17.1 |
對話模型 | 基準測試 | 領域 | DeepSeek 7B 對話 (SFT) | DeepSeekMoE 16B 對話 (SFT) | DeepSeek-V2-Lite 16B 對話 (SFT) | |:-----------:|:----------------:|:------------------:|:---------------:|:---------------------:| | MMLU | 英語 | 49.7 | 47.2 | 55.7 | | BBH | 英語 | 43.1 | 42.2 | 48.1 | | C-Eval | 中文 | 44.7 | 40.0 | 60.1 | | CMMLU | 中文 | 51.2 | 49.3 | 62.5 | | HumanEval | 代碼 | 45.1 | 45.7 | 57.3 | | MBPP | 代碼 | 39.0 | 46.2 | 45.8 | | GSM8K | 數學 | 62.6 | 62.2 | 72.0 | | Math | 數學 | 14.7 | 15.2 | 27.9 |

🔧 技術細節

模型架構

注意力機制：設計了 MLA（多頭潛在注意力），通過將鍵值（KV）緩存顯著壓縮為潛在向量，保證了高效推理。
前饋網絡（FFNs）：採用 DeepSeekMoE 架構，通過稀疏計算以經濟的成本訓練強大的模型。

訓練細節

DeepSeek-V2-Lite 在與 DeepSeek-V2 相同的預訓練語料庫上從頭開始訓練，未受任何 SFT 數據汙染。使用 AdamW 優化器，學習率採用熱身和步長衰減策略。訓練時最大序列長度為 4K，在 5.7T 標記上進行訓練。預訓練後進行長上下文擴展和 SFT 得到對話模型。

📄 許可證

代碼倉庫遵循 MIT 許可證，DeepSeek-V2 基礎/對話模型的使用遵循模型許可證，支持商業使用。

引用

@misc{deepseekv2,
      title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, 
      author={DeepSeek-AI},
      year={2024},
      eprint={2405.04434},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}