MistralLite開源語言模型 - 免費部署支持長上下文檢索與問答

首頁

Mistrallite

由amazon開發

MistralLite是基於Mistral-7B-v0.1微調的語言模型，增強了處理長上下文（最高達32K tokens）的能力，適用於長上下文檢索與問答等場景。

大型語言模型

Transformers

開源協議:Apache-2.0 #長上下文處理 #32K tokens支持 #問答增強

下載量 20.24k

發布時間 : 10/16/2023

模型概述

MistralLite是一個優化了長上下文處理能力的語言模型，通過改進的旋轉位置編碼和滑動窗口技術，顯著提升了在長上下文任務中的表現，同時保持了原模型的簡潔結構。

模型特點

長上下文處理

支持最高32K tokens的上下文長度，通過改進的旋轉位置編碼和滑動窗口技術優化長上下文處理能力。

性能提升

在多項長上下文檢索與問答任務中表現顯著提升，如主題檢索、行檢索等。

資源高效

可在單臺AWS g5.2x實例上部署，適合資源受限環境下的高性能需求。

模型能力

長上下文文本生成

主題檢索

行檢索

密鑰檢索

長文本問答

摘要生成

使用案例

信息檢索

主題檢索

在長文檔中檢索特定主題的信息

在2851至13780 tokens的測試中，準確率從98%到100%不等

行檢索

在長文檔中檢索特定行的信息

在3818至12657 tokens的測試中，準確率從60%到98%不等

問答系統

長文本問答

基於長文檔內容回答複雜問題

測試集準確率64.4%，困難子集準確率56.2%

🚀 MistralLite模型

MistralLite是一個經過微調的Mistral-7B-v0.1語言模型，具備增強的長上下文處理能力（最多支持32K個標記）。在微調過程中，通過採用自適應旋轉嵌入（Rotary Embedding）和滑動窗口，MistralLite能夠在多個長上下文檢索和問答任務中表現得更為出色，同時保留了原始模型簡單的結構。MistralLite適用於長上下文行和主題檢索、摘要生成、問答等應用場景。它可以部署在單個AWS g5.2x實例上，並通過Sagemaker Huggingface文本生成推理（TGI）端點運行，非常適合在資源受限的環境中對性能有較高要求的應用。你也可以直接使用TGI Docker容器來運行MistralLite模型。此外，MistralLite還支持像vLLM這樣的其他服務方式，你可以在Python中使用HuggingFace transformers和FlashAttention-2庫來使用MistralLite。

✨ 主要特性

長上下文處理能力強：最多支持32K個標記的長上下文處理。
性能提升顯著：在多個長上下文相關任務中表現優於Mistral-7B-Instruct-v0.1。
部署方式多樣：支持多種部署方式，如在AWS實例上通過Sagemaker端點部署、使用TGI Docker容器部署、使用vLLM服務等。

📦 安裝指南

使用Python代碼（HuggingFace transformers）

需要安裝以下必要的包：

transformers 4.34.0 或更高版本
flash-attn 2.3.1.post1 或更高版本
accelerate 0.23.0 或更高版本

pip install transformers==4.34.0
pip install flash-attn==2.3.1.post1 --no-build-isolation
pip install accelerate==0.23.0

💻 使用示例

基礎用法

from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch

model_id = "amazon/MistralLite"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             torch_dtype=torch.bfloat16,
                                             use_flash_attention_2=True,
                                             device_map="auto",)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)
prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"

sequences = pipeline(
    prompt,
    max_new_tokens=400,
    do_sample=False,
    return_full_text=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"{seq['generated_text']}")

重要提示：使用MistralLite時，請使用以下提示模板：

<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>

📚 詳細文檔

與Mistral-7B-Instruct-v0.1的對比

MistralLite與Mistral-7B-Instruct-v0.1相似，它們的異同點總結如下：

屬性	詳情
模型類型	Mistral-7B-v0.1微調版本
訓練數據	SLidingEncoder and Decoder (SLED)、(Long) Natural Questions (NQ)、OpenAssistant Conversations Dataset (OASST1)
最大上下文長度	32K
旋轉嵌入適配	Mistral-7B-Instruct-v0.1：rope_theta = 10000；MistralLite：rope_theta = 1000000
滑動窗口大小	Mistral-7B-Instruct-v0.1：4096；MistralLite：16384

開發動機

自Mistral-7B-Instruct-v0.1發佈以來，該模型因其在廣泛基準測試中的出色表現而越來越受歡迎。但大多數基準測試是在短上下文上進行評估的，對於其在長上下文任務中的性能研究較少。因此，我們針對專門評估大語言模型處理長上下文能力的基準測試對Mistral-7B-Instruct-v0.1進行了評估。儘管該模型在小於4096個標記的長上下文上表現具有一定競爭力，但在更長的上下文上存在一些性能限制。為了提高其在長上下文上的性能，我們對Mistral 7B模型進行了微調，得到了MistralLite。該模型在長上下文處理性能上相較於Mistral-7B-Instruct-v0.1有了顯著提升。詳細的長上下文評估結果如下：

主題檢索 | 模型名稱 | 輸入長度2851 | 輸入長度5568 | 輸入長度8313 | 輸入長度11044 | 輸入長度13780 | |----------|-------------:|-------------:|------------:|-----------:|-----------:| | Mistral-7B-Instruct-v0.1 | 100% | 50% | 2% | 0% | 0% | | MistralLite | 100% | 100% | 100% | 100% | 98% |
行檢索 | 模型名稱 | 輸入長度3818 | 輸入長度5661 | 輸入長度7505 | 輸入長度9354 | 輸入長度11188 | 輸入長度12657 | |----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:| | Mistral-7B-Instruct-v0.1 | 98% | 62% | 42% | 42% | 32% | 30% | | MistralLite | 98% | 92% | 88% | 76% | 70% | 60% |
密鑰檢索 | 模型名稱 | 輸入長度3264 | 輸入長度5396 | 輸入長度8329 | 輸入長度10197 | |----------|-------------:|-------------:|------------:|-----------:| | Mistral-7B-Instruct-v0.1 | 100% | 50% | 20% | 30% | | MistralLite | 100% | 100% | 100% | 100% |
長輸入文本問答 | 模型名稱 | 測試集準確率 | 難子集準確率 | |----------|-------------:|-------------:| | Mistral-7B-Instruct-v0.1 | 44.3% | 39.7% | | MistralLite | 64.4% | 56.2% |

13400個標記長上下文的示例測試

上下文來自Amazon Aurora常見問題解答
問題：請告訴我pgvector如何助力生成式AI，並給我一些示例。
MistralLite的回答：

pgvector是Amazon Aurora PostgreSQL兼容版支持的PostgreSQL開源擴展。

你可以使用pgvector在數據庫中存儲、搜索、索引和查詢由機器學習（ML）和人工智能（AI）模型生成的數十億個嵌入向量，例如來自Amazon Bedrock（有限預覽版）或Amazon SageMaker的嵌入向量。向量嵌入是一種數值表示，代表文本、圖像和視頻等內容的語義含義。

使用pgvector，你可以在Aurora PostgreSQL數據庫中查詢嵌入向量，對這些以向量形式表示的數據類型與Aurora中的其他表格數據進行高效的語義相似性搜索。這使得生成式AI和其他AI/ML系統能夠用於新型應用，例如基於相似文本描述或圖像的個性化推薦、基於面試筆記的候選人匹配、基於成功對話記錄或聊天會話的客戶服務下一步最佳行動推薦等等。

MistralLite LM-Eval結果

方法

請參考https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
版本：4ececff
注意：我們使用--model hf-causal-experimental而不是--model hf-causal

結果

平均值	hellaswag	arc_challenge	truthful_qa (mc2)	MMLU (acc)
0.57221	0.81617	0.58874	0.38275	0.5012

🔧 技術細節

開發者：AWS貢獻者
模型類型：Mistral-7B-v0.1
語言：英語
微調基礎權重：Mistral-7B-v0.1
支持的服務框架：
模型許可證：Apache 2.0
聯繫方式：GitHub問題
推理代碼：GitHub倉庫

📄 許可證

本模型採用Apache 2.0許可證。

不同服務方式說明

在TGI上服務MistralLite

重要提示：

如需使用原生TGI容器的端到端示例Jupyter筆記本，請參考此鏈接。
如果輸入上下文長度大於12K個標記，建議使用自定義TGI容器，請參考此鏈接。

啟動TGI服務器

使用TGI 1.1.0或更高版本。官方Docker容器為：ghcr.io/huggingface/text-generation-inference:1.1.0

示例Docker參數：

docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
      --model-id amazon/MistralLite \
      --max-input-length 16000 \
      --max-total-tokens 16384 \
      --max-batch-prefill-tokens 16384 \
      --trust-remote-code

進行推理

使用TGI進行推理的示例Python代碼（需要text_generation 0.6.1或更高版本）：

pip install text_generation==0.6.1

from text_generation import Client

SERVER_PORT = 443
SERVER_HOST = "localhost"
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)

def invoke_tgi(prompt, 
                      random_seed=1, 
                      max_new_tokens=400, 
                      print_stream=True,
                      assist_role=True):
    if (assist_role):
        prompt = f"<|prompter|>{prompt}</s><|assistant|>"
    output = ""
    for response in tgi_client.generate_stream(
        prompt,
        do_sample=False,
        max_new_tokens=max_new_tokens,
        return_full_text=False,
        #temperature=None,
        #truncate=None,
        #seed=random_seed,
        #typical_p=0.2,
    ):
        if hasattr(response, "token"):
            if not response.token.special:
                snippet = response.token.text
                output += snippet
                if (print_stream):
                    print(snippet, end='', flush=True)
    return output

prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_tgi(prompt)

重要提示：首次使用MistralLite進行推理時，可能需要一個短暫的“預熱”期，這可能需要10多秒。不過，後續的推理應該會更快，並能更及時地返回結果。這個預熱期是正常的，一旦初始化期完成，不會影響系統的整體性能。

在Amazon SageMaker上部署MistralLite

重要提示：

如需使用SageMaker內置容器的端到端示例Jupyter筆記本，請參考此鏈接。
如果輸入上下文長度大於12K個標記，建議使用自定義Docker容器，請參考此鏈接。

安裝必要的包

需要安裝sagemaker 2.192.1或更高版本。

pip install sagemaker==2.192.1

將模型部署為SageMaker端點

要在SageMaker端點上部署MistralLite，請按照以下示例代碼操作：

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface", # or lmi
  region=region,
 version="1.1.0"
)

model_name = "MistralLite-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    'HF_MODEL_ID':'amazon/MistralLite',
    'HF_TASK':'text-generation',
    'SM_NUM_GPUS':'1',
    "MAX_INPUT_LENGTH": '16000',
    "MAX_TOTAL_TOKENS": '16384',
    "MAX_BATCH_PREFILL_TOKENS": '16384',
    "MAX_BATCH_TOTAL_TOKENS":  '16384',
}

model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)
predictor = model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.2xlarge",
  endpoint_name=model_name,
    
)

進行推理

要調用端點，請按照以下示例代碼操作：

input_data = {
  "inputs": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
  "parameters": {
    "do_sample": False,
    "max_new_tokens": 400,
    "return_full_text": False,
    #"typical_p": 0.2,
    #"temperature":None,
    #"truncate":None,
    #"seed": 1,
  }
}
result = predictor.predict(input_data)[0]["generated_text"]
print(result)

或者通過boto3，示例代碼如下：

import boto3
import json
def call_endpoint(client, prompt, endpoint_name, paramters):
    client = boto3.client("sagemaker-runtime")
    payload = {"inputs": prompt,
               "parameters": parameters}
    response = client.invoke_endpoint(EndpointName=endpoint_name,
                                      Body=json.dumps(payload), 
                                      ContentType="application/json")
    output = json.loads(response["Body"].read().decode())
    result = output[0]["generated_text"]
    return result

client = boto3.client("sagemaker-runtime")
parameters = {
    "do_sample": False,
    "max_new_tokens": 400,
    "return_full_text": False,
    #"typical_p": 0.2,
    #"temperature":None,
    #"truncate":None,
    #"seed": 1,
}
endpoint_name = predictor.endpoint_name
prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"
result = call_endpoint(client, prompt, endpoint_name, parameters)
print(result)

在vLLM上服務MistralLite

關於安裝和使用vLLM的文檔請參考此處。 重要提示：如需端到端示例Jupyter筆記本，請參考此鏈接。

將vLLM用作服務器

將vLLM用作服務器時，傳遞--model amazon/MistralLite參數，例如：

python3 -m vllm.entrypoints.api_server --model amazon/MistralLite

在Python代碼中使用vLLM

在Python代碼中使用vLLM時，請參考以下示例代碼：

from vllm import LLM, SamplingParams

prompts = [
   "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
]
sampling_params = SamplingParams(temperature=0, max_tokens=100)

llm = LLM(model="amazon/MistralLite",)

outputs = llm.generate(prompts, sampling_params)

# 打印輸出。
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")