MistralLiteオープンソース言語モデル - 無料でのデプロイが可能、長文脈処理と質問応答検索の良いヘルパー

ホーム

Mistrallite

AWSによって開発

MistralLiteはMistral-7B-v0.1をファインチューニングした言語モデルで、長文脈処理（最大32Kトークン）能力が強化されており、長文脈検索、テーマ要約、QAなどのシナリオに適しています。

大規模言語モデル

Transformers

オープンソースライセンス:Apache-2.0 #長文脈処理 #32Kトークン対応 #効率的な検索QA

ダウンロード数 61.78k

リリース時間 : 10/16/2023

モデル概要

MistralLiteは最適化された言語モデルで、改良型ローテーションポジションエンコーディングとスライディングウィンドウ技術により、長文脈処理能力が大幅に向上し、リソースが限られた高性能ニーズのシナリオに適しています。

モデル特徴

長文脈処理能力

最大32Kトークンの文脈長をサポートし、長文タスクのパフォーマンスを大幅に向上させます。

改良型ローテーションポジションエンコーディング

rope_theta = 1000000のパラメータ設定を採用し、長シーケンス処理を最適化します。

大規模スライディングウィンドウ

スライディングウィンドウサイズを16384に拡大し、長距離依存関係の捕捉能力を強化します。

効率的な展開

単一のAWS g5.2xインスタンスで展開可能で、リソースが限られたシナリオに適しています。

モデル能力

長文脈検索

テーマ要約

QAシステム

テキスト生成

使用事例

情報検索

長文書テーマ検索

13.7Kトークン長で98%の精度を維持するテーマ検索能力。

オリジナルモデル（8.3Kトークンで2%に低下）を大幅に上回る

行単位情報特定

長文書内で特定の行情報を正確に特定します。

12.6Kトークン時で精度60%（オリジナルモデル30%）

QAシステム

長文テキストQA

複雑な文脈を含むQAタスクを処理します。

テストセット精度が44.3%から64.4%に向上

🚀 MistralLiteモデル

MistralLiteは、Mistral-7B-v0.1をファインチューニングした言語モデルで、長いコンテキスト（最大32Kトークン）を処理する能力が強化されています。ファインチューニング時に適応型ロータリー埋め込みとスライディングウィンドウを利用することで、MistralLiteはいくつかの長文コンテキストの検索と回答タスクで大幅に優れた性能を発揮し、同時に元のモデルのシンプルな構造を維持しています。MistralLiteは、長文コンテキストの行やトピックの検索、要約、質問応答などのアプリケーションに役立ちます。MistralLiteは、AWSの単一のg5.2xインスタンスでSagemaker Huggingface Text Generation Inference (TGI)エンドポイントを使用してデプロイできるため、リソースが制限された環境で高性能が必要なアプリケーションに適しています。また、TGIのDockerコンテナを使用してMistralLiteモデルを直接提供することもできます。さらに、MistralLiteはvLLMなどの他の提供方法もサポートしており、PythonでHuggingFace transformersとFlashAttention-2ライブラリを使用して利用することができます。

MistralLiteはMistral-7B-Instruct-v0.1に似ており、それらの類似点と相違点を以下にまとめます。

モデル	長文コンテキストでのファインチューニング	最大コンテキスト長	ロータリー埋め込みの適応	スライディングウィンドウサイズ
Mistral-7B-Instruct-v0.1	最大8Kトークン	32K	rope_theta = 10000	4096
MistralLite	最大16Kトークン	32K	rope_theta = 1000000	16384

⚠️ 重要提示

MistralLiteには以下のプロンプトテンプレートを使用してください。

<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>

✨ 主な機能

MistralLite開発の動機

Mistral-7B-Instruct-v0.1がリリースされて以来、その幅広いベンチマークでの強力な性能により、このモデルはますます人気を博しています。しかし、ほとんどのベンチマークは短いコンテキストで評価されており、長いコンテキストタスクでの性能についてはあまり調査されていませんでした。そこで、我々はMistral-7B-Instruct-v0.1を、LLMの長いコンテキストを処理する能力を評価するために特別に設計されたベンチマークで評価しました。その結果、4096トークン未満の長いコンテキストでは、モデルの性能はかなり競争力がありましたが、より長いコンテキストでの性能にはいくつかの制限があることがわかりました。この長いコンテキストでの性能を向上させることを動機として、我々はMistral 7Bモデルをファインチューニングし、Mistralliteを作成しました。このモデルは、Mistral-7B-Instruct-v0.1に比べて長いコンテキストの処理性能を大幅に向上させることに成功しました。詳細な長文コンテキストの評価結果は以下の通りです。

トピック検索

モデル名	入力長	入力長	入力長	入力長	入力長
	2851	5568	8313	11044	13780
Mistral-7B-Instruct-v0.1	100%	50%	2%	0%	0%
MistralLite	100%	100%	100%	100%	98%

行検索

モデル名	入力長	入力長	入力長	入力長	入力長	入力長
	3818	5661	7505	9354	11188	12657
Mistral-7B-Instruct-v0.1	98%	62%	42%	42%	32%	30%
MistralLite	98%	92%	88%	76%	70%	60%

パスキー検索

モデル名	入力長	入力長	入力長	入力長
	3264	5396	8329	10197
Mistral-7B-Instruct-v0.1	100%	50%	20%	30%
MistralLite	100%	100%	100%	100%

長い入力テキストによる質問応答

モデル名	テストセットの正解率	ハードサブセットの正解率
Mistral-7B-Instruct-v0.1	44.3%	39.7%
MistralLite	64.4%	56.2%

13400トークンの長文コンテキストでのテスト例

コンテキストはAmazon AuroraのFAQから取得
質問: please tell me how does pgvector help with Generative AI and give me some examples.
MistralLiteからの回答:

pgvector is an open-source extension for PostgreSQL supported by Amazon Aurora PostgreSQL-Compatible Edition.

You can use pgvector to store, search, index, and query billions of embeddings that are generated from machine learning (ML) and artificial intelligence (AI) models in your database, such as those from Amazon Bedrock (limited preview) or Amazon SageMaker. A vector embedding is a numerical representation that represents the semantic meaning of content such as text, images, and video.

With pgvector, you can query embeddings in your Aurora PostgreSQL database to perform efficient semantic similarity searches of these data types, represented as vectors, combined with other tabular data in Aurora. This enables the use of generative AI and other AI/ML systems for new types of applications such as personalized recommendations based on similar text descriptions or images, candidate match based on interview notes, customer service next best action recommendations based on successful transcripts or chat session dialogs, and more.

📚 ドキュメント

モデルの詳細

属性	详情
開発者	AWS Contributors
モデルタイプ	Mistral-7B-v0.1
言語	英語
ファインチューニング元の重み	Mistral-7B-v0.1
ファインチューニングに使用したデータ	SLidingEncoder and Decoder (SLED)、(Long) Natural Questions (NQ)、OpenAssistant Conversations Dataset (OASST1)
サポートされる提供フレームワーク	Text-Generation-Inference 1.1.0、vLLM、HuggingFace transformers、HuggingFace Text Generation Inference (TGI) container on SageMaker
モデルのライセンス	Apache 2.0
問い合わせ先	GitHub issues
推論コード	Github Repo

MistralLiteのLM-Eval結果

方法論

詳細はhttps://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard を参照してください。
revision=4ececff
注意: --model hf-causal-experimentalを使用し、--model hf-causalは使用しませんでした。

結果

平均	hellaswag	arc_challenge	truthful_qa (mc2)	MMLU (acc)
0.57221	0.81617	0.58874	0.38275	0.5012

💻 使用例

Pythonコード（HuggingFace transformers）からMistralLiteを使用する方法

⚠️ 重要提示

エンドツーエンドのJupyterノートブックの例については、このリンクを参照してください。

必要なパッケージをインストールする

transformers 4.34.0以降、flash-attn 2.3.1.post1以降、およびaccelerate 0.23.0以降が必要です。

pip install transformers==4.34.0
pip install flash-attn==2.3.1.post1 --no-build-isolation
pip install accelerate==0.23.0

以下のサンプルコードを試すことができます

from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch

model_id = "amazon/MistralLite"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             torch_dtype=torch.bfloat16,
                                             use_flash_attention_2=True,
                                             device_map="auto",)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)
prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"

sequences = pipeline(
    prompt,
    max_new_tokens=400,
    do_sample=False,
    return_full_text=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"{seq['generated_text']}")

⚠️ 重要提示

MistralLiteには以下のプロンプトテンプレートを使用してください。

<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>

TGIでMistralLiteを提供する方法

⚠️ 重要提示

ネイティブのTGIコンテナを使用したエンドツーエンドのJupyterノートブックの例については、このリンクを参照してください。

入力コンテキストの長さが12Kトークンを超える場合、カスタムTGIコンテナの使用をお勧めします。詳細はこのリンクを参照してください。

TGIサーバーを起動する

TGIバージョン1.1.0以降を使用します。公式のDockerコンテナはghcr.io/huggingface/text-generation-inference:1.1.0です。

Dockerパラメータの例:

docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
      --model-id amazon/MistralLite \
      --max-input-length 16000 \
      --max-total-tokens 16384 \
      --max-batch-prefill-tokens 16384 \
      --trust-remote-code

推論を実行する

TGIで推論を行うPythonコードの例（text_generation 0.6.1以降が必要）:

pip install text_generation==0.6.1

from text_generation import Client

SERVER_PORT = 443
SERVER_HOST = "localhost"
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)

def invoke_tgi(prompt, 
                      random_seed=1, 
                      max_new_tokens=400, 
                      print_stream=True,
                      assist_role=True):
    if (assist_role):
        prompt = f"<|prompter|>{prompt}</s><|assistant|>"
    output = ""
    for response in tgi_client.generate_stream(
        prompt,
        do_sample=False,
        max_new_tokens=max_new_tokens,
        return_full_text=False,
        #temperature=None,
        #truncate=None,
        #seed=random_seed,
        #typical_p=0.2,
    ):
        if hasattr(response, "token"):
            if not response.token.special:
                snippet = response.token.text
                output += snippet
                if (print_stream):
                    print(snippet, end='', flush=True)
    return output

prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_tgi(prompt)

⚠️ 重要提示

MistralLiteを初めて推論に使用する場合、10秒程度の「ウォームアップ」期間が必要になることがあります。ただし、その後の推論は高速化され、よりタイムリーに結果が返されます。このウォームアップ期間は正常な現象であり、初期化期間が完了すればシステムの全体的な性能に影響を与えません。

Amazon SageMakerにMistralLiteをデプロイする方法

⚠️ 重要提示

SageMakerの組み込みコンテナを使用したエンドツーエンドのJupyterノートブックの例については、このリンクを参照してください。

入力コンテキストの長さが12Kトークンを超える場合、カスタムDockerコンテナの使用をお勧めします。詳細はこのリンクを参照してください。

必要なパッケージをインストールする

sagemaker 2.192.1以降が必要です。

pip install sagemaker==2.192.1

モデルをSageMakerエンドポイントとしてデプロイする

MistralLiteをSageMakerエンドポイントにデプロイするには、以下のサンプルコードに従ってください。

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface", # or lmi
  region=region,
 version="1.1.0"
)

model_name = "MistralLite-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    'HF_MODEL_ID':'amazon/MistralLite',
    'HF_TASK':'text-generation',
    'SM_NUM_GPUS':'1',
    "MAX_INPUT_LENGTH": '16000',
    "MAX_TOTAL_TOKENS": '16384',
    "MAX_BATCH_PREFILL_TOKENS": '16384',
    "MAX_BATCH_TOTAL_TOKENS":  '16384',
}

model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)
predictor = model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.2xlarge",
  endpoint_name=model_name,
    
)

推論を実行する

エンドポイントを呼び出すには、以下のサンプルコードに従ってください。

input_data = {
  "inputs": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
  "parameters": {
    "do_sample": False,
    "max_new_tokens": 400,
    "return_full_text": False,
    #"typical_p": 0.2,
    #"temperature":None,
    #"truncate":None,
    #"seed": 1,
  }
}
result = predictor.predict(input_data)[0]["generated_text"]
print(result)

またはboto3を使用することもできます。例のコードは以下の通りです。

import boto3
import json
def call_endpoint(client, prompt, endpoint_name, paramters):
    client = boto3.client("sagemaker-runtime")
    payload = {"inputs": prompt,
               "parameters": parameters}
    response = client.invo