DeepSeek LLM 7B Base AWQ開源大語言模型 - 免費部署高效推理問答

首頁

Deepseek Llm 7B Base AWQ

由TheBloke開發

Deepseek LLM 7B Base 是一個7B參數規模的基礎大語言模型，採用AWQ量化技術優化推理效率。

大型語言模型

Transformers

開源協議:其他 #4位量化推理 #高效Transformer #長文本處理

下載量 1,863

發布時間 : 11/29/2023

模型概述

該模型是DeepSeek開發的7B參數基礎語言模型，支持高效的4位量化推理，適用於多種文本生成任務。

模型特點

高效量化

採用AWQ 4位量化技術，在保持模型質量的同時顯著提升推理速度

長上下文支持

支持長達4096 tokens的上下文長度

多平臺兼容

支持文本生成Web界面、vLLM、Hugging Face TGI等多種推理平臺

模型能力

文本生成

問答系統

內容創作

代碼生成

使用案例

內容創作

故事創作

生成連貫的短篇故事或小說章節

可生成符合邏輯且風格一致的敘事內容

問答系統

知識問答

回答用戶提出的各類知識性問題

能提供準確且上下文相關的答案

🚀 DeepSeek LLM 7B Base - AWQ

本項目包含了 DeepSeek 的 DeepSeek LLM 7B Base 的 AWQ 模型文件。這些文件使用了 Massed Compute 慷慨提供的硬件進行量化。

🚀 快速開始

模型信息

屬性	詳情
模型創建者	DeepSeek
原始模型	DeepSeek LLM 7B Base

可用倉庫

提示模板

{prompt}

✨ 主要特性

關於 AWQ

AWQ 是一種高效、準確且極快的低比特權重量化方法，目前支持 4 比特量化。與 GPTQ 相比，在基於 Transformer 的推理中，它能提供更快的速度，並且在質量上與最常用的 GPTQ 設置相當或更優。

它支持以下應用：

Text Generation Webui - 使用加載器：AutoAWQ
vLLM - 僅支持 Llama 和 Mistral 模型
Hugging Face Text Generation Inference (TGI)
Transformers 版本 4.35.0 及更高版本，適用於任何支持 Transformers 的代碼或客戶端
AutoAWQ - 用於 Python 代碼

📦 安裝指南

提供的文件和 AWQ 參數

目前僅發佈 128g GEMM 模型。正在積極考慮添加組大小為 32 的模型和 GEMV 內核模型。

模型以分片的 safetensors 文件形式發佈。

分支	比特數	組大小	AWQ 數據集	序列長度	大小
main	4	128	VMware Open Instruct	4096	4.83 GB

在 text-generation-webui 中輕鬆下載和使用此模型

請確保使用的是 text-generation-webui 的最新版本。強烈建議使用 text-generation-webui 的一鍵安裝程序，除非你確定知道如何手動安裝。

點擊模型選項卡。
在 下載自定義模型或 LoRA 下，輸入 TheBloke/deepseek-llm-7B-base-AWQ。
點擊下載。
模型將開始下載。下載完成後會顯示“已完成”。
在左上角，點擊模型旁邊的刷新圖標。
在模型下拉菜單中，選擇你剛剛下載的模型：deepseek-llm-7B-base-AWQ
選擇 加載器：AutoAWQ。
點擊加載，模型將加載並準備好使用。
如果你需要任何自定義設置，請進行設置，然後點擊右上角的 保存此模型的設置，接著點擊 重新加載模型。
準備好後，點擊 文本生成 選項卡並輸入提示以開始！

使用 vLLM 進行多用戶推理服務器部署

有關安裝和使用 vLLM 的文檔請點擊此處。

請確保使用的是 vLLM 版本 0.2 或更高版本。
使用 vLLM 作為服務器時，請傳遞 --quantization awq 參數。

例如：

python3 -m vllm.entrypoints.api_server --model TheBloke/deepseek-llm-7B-base-AWQ --quantization awq --dtype auto

使用 Python 代碼調用 vLLM

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''{prompt}
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/deepseek-llm-7B-base-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

使用 Hugging Face Text Generation Inference (TGI) 進行多用戶推理服務器部署

使用 TGI 版本 1.1.0 或更高版本。官方 Docker 容器為：ghcr.io/huggingface/text-generation-inference:1.1.0

示例 Docker 參數：

--model-id TheBloke/deepseek-llm-7B-base-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

示例 Python 代碼與 TGI 交互（需要 huggingface-hub 0.17.0 或更高版本）：

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

使用 Transformers 從 Python 代碼進行推理

安裝必要的包

需要 Transformers 4.35.0 或更高版本。
需要 AutoAWQ 0.1.6 或更高版本。

pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0"

注意：如果你使用的是 PyTorch 2.0.1，上述 AutoAWQ 命令將自動將你升級到 PyTorch 2.1.0。

如果你使用的是 CUDA 11.8 並希望繼續使用 PyTorch 2.0.1，請運行以下命令：

pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

如果你在使用預構建的輪子安裝 AutoAWQ 時遇到問題，請從源代碼安裝：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

Transformers 示例代碼（需要 Transformers 4.35.0 及更高版本）

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name_or_path = "TheBloke/deepseek-llm-7B-base-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    low_cpu_mem_usage=True,
    device_map="cuda:0"
)

# Using the text streamer to stream output one token at a time
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

# Convert prompt to tokens
tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

generation_params = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1
}

# Generate streamed output, visible one token at a time
generation_output = model.generate(
    tokens,
    streamer=streamer,
    **generation_params
)

# Generation without a streamer, which will include the prompt in the output
generation_output = model.generate(
    tokens,
    **generation_params
)

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("model.generate output: ", text_output)

# Inference is also possible via Transformers' pipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **generation_params
)

pipe_output = pipe(prompt_template)[0]['generated_text']
print("pipeline output: ", pipe_output)

💻 使用示例

基礎用法

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-llm-7b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

📚 詳細文檔

兼容性

提供的文件經過測試，可與以下應用兼容：

使用 Loader: AutoAWQ 的 text-generation-webui
版本 0.2.0 及更高版本的 vLLM
版本 1.1.0 及更高版本的 Hugging Face Text Generation Inference (TGI)
版本 4.35.0 及更高版本的 Transformers
版本 0.1.1 及更高版本的 AutoAWQ

📄 許可證

此代碼倉庫遵循 MIT 許可證。使用 DeepSeek LLM 模型需遵循模型許可證。DeepSeek LLM 支持商業使用。

更多詳細信息請參閱 LICENSE-MODEL。

🔗 聯繫方式

如果你有任何問題，請提出問題或通過 service@deepseek.com 聯繫我們。

💬 Discord

如需進一步支持，以及討論這些模型和人工智能相關話題，請加入我們的 TheBloke AI 的 Discord 服務器。

🙏 致謝與貢獻方式

感謝 chirper.ai 團隊！感謝來自 gpus.llm-utils.org 的 Clay！

很多人詢問是否可以進行貢獻。我喜歡提供模型並幫助他人，也希望能夠花更多時間做這件事，同時拓展到新的項目，如微調/訓練。

如果你有能力且願意貢獻，將不勝感激，這將幫助我繼續提供更多模型，並開展新的人工智能項目。

捐贈者將在任何人工智能/大語言模型/模型相關的問題和請求上獲得優先支持，訪問私人 Discord 房間，以及其他福利。

Patreon: https://patreon.com/TheBlokeAI
Ko-Fi: https://ko-fi.com/TheBlokeAI

特別感謝：Aemon Algiz。

Patreon 特別提及：Brandon Frisco, LangChain4j, Spiking Neurons AB, transmissions 11, Joseph William Delisle, Nitin Borwankar, Willem Michiel, Michael Dempsey, vamX, Jeffrey Morgan, zynix, jjj, Omer Bin Jawed, Sean Connelly, jinyuan sun, Jeromy Smith, Shadi, Pawan Osman, Chadd, Elijah Stavena, Illia Dulskyi, Sebastain Graf, Stephen Murray, terasurfer, Edmond Seymore, Celu Ramasamy, Mandus, Alex, biorpg, Ajan Kanaga, Clay Pascal, Raven Klaugh, 阿明, K, ya boyyy, usrbinkat, Alicia Loh, John Villwock, ReadyPlayerEmma, Chris Smitley, Cap'n Zoog, fincy, GodLy, S_X, sidney chen, Cory Kujawski, OG, Mano Prime, AzureBlack, Pieter, Kalila, Spencer Kim, Tom X Nguyen, Stanislav Ovsiannikov, Michael Levine, Andrey, Trailburnt, Vadim, Enrico Ros, Talal Aujan, Brandon Phillips, Jack West, Eugene Pentland, Michael Davis, Will Dee, webtim, Jonathan Leane, Alps Aficionado, Rooh Singh, Tiffany J. Kim, theTransient, Luke @flexchar, Elle, Caitlyn Gatomon, Ari Malik, subjectnull, Johann-Peter Hartmann, Trenton Dambrowitz, Imad Khwaja, Asp the Wyvern, Emad Mostaque, Rainer Wilmers, Alexandros Triantafyllidis, Nicholas, Pedro Madruga, SuperWojo, Harry Royden McLaughlin, James Bentley, Olakabola, David Ziegler, Ai Maven, Jeff Scroggin, Nikolai Manek, Deo Leter, Matthew Berman, Fen Risland, Ken Nordquist, Manuel Alberto Morcote, Luke Pendergrass, TL, Fred von Graf, Randy H, Dan Guido, NimbleBox.ai, Vitor Caleffi, Gabriel Tamborski, knownsqashed, Lone Striker, Erik Bjäreholt, John Detwiler, Leonard Tan, Iucharbius

感謝所有慷慨的贊助者和捐贈者！再次感謝 a16z 的慷慨資助。