MistralLiteオープンソース言語モデル - 無料でデプロイ可能、長文脈検索と質問応答をサポート

ホーム

Mistrallite

amazonによって開発

MistralLiteはMistral-7B-v0.1をファインチューニングした言語モデルで、長文脈（最大32Kトークン）の処理能力を強化しており、長文脈検索や質疑応答などのシナリオに適しています。

大規模言語モデル

Transformers

オープンソースライセンス:Apache-2.0 #長文脈処理 #32Kトークン対応 #質疑応答強化

ダウンロード数 20.24k

リリース時間 : 10/16/2023

モデル概要

MistralLiteは長文脈処理能力を最適化した言語モデルで、改良された回転位置エンコーディングとスライディングウィンドウ技術により、長文脈タスクでのパフォーマンスを大幅に向上させながら、元のモデルのシンプルな構造を維持しています。

モデル特徴

長文脈処理

最大32Kトークンの文脈長をサポートし、改良された回転位置エンコーディングとスライディングウィンドウ技術で長文脈処理能力を最適化しています。

性能向上

複数の長文脈検索と質疑応答タスク（トピック検索、行検索など）で顕著な性能向上を示しています。

リソース効率

単一のAWS g5.2xインスタンスでデプロイ可能で、リソースが限られた環境での高性能ニーズに適しています。

モデル能力

長文脈テキスト生成

トピック検索

行検索

キー検索

長文テキスト質疑応答

要約生成

使用事例

情報検索

トピック検索

長文書内で特定のトピックの情報を検索

2851から13780トークンのテストで、精度は98%から100%の範囲

行検索

長文書内で特定の行の情報を検索

3818から12657トークンのテストで、精度は60%から98%の範囲

質疑応答システム

長文テキスト質疑応答

長文書の内容に基づいて複雑な質問に回答

テストセット精度64.4%、困難なサブセット精度56.2%

🚀 MistralLiteモデル

MistralLiteは、Mistral-7B-v0.1をファインチューニングした言語モデルで、長文脈（最大32Kトークン）の処理能力が強化されています。ファインチューニング時に適応型のRotary Embeddingとスライディングウィンドウを利用することで、MistralLiteは元のモデルのシンプルな構造を維持しながら、いくつかの長文脈検索と回答タスクで大幅に性能を向上させることができます。MistralLiteは、長文脈の行とトピック検索、要約、質問応答などのアプリケーションに役立ちます。MistralLiteは、AWSの単一のg5.2xインスタンスでSagemaker Huggingface Text Generation Inference (TGI)エンドポイントを使用してデプロイできるため、リソースが制限された環境で高性能が必要なアプリケーションに適しています。また、TGIのDockerコンテナを使用してMistralLiteモデルを直接サービングすることもできます。さらに、MistralLiteはvLLMなどの他のサービング方法もサポートしており、HuggingFace transformersとFlashAttention-2ライブラリを使用してPythonでMistralLiteを利用することができます。

MistralLiteはMistral-7B-Instruct-v0.1に似ており、それらの類似点と相違点を以下にまとめます。

モデル	長文脈でのファインチューニング	最大文脈長	RotaryEmbeddingの適応	スライディングウィンドウサイズ
Mistral-7B-Instruct-v0.1	最大8Kトークン	32K	rope_theta = 10000	4096
MistralLite	最大16Kトークン	32K	rope_theta = 1000000	16384

⚠️ 重要提示

MistralLiteには以下のプロンプトテンプレートを使用してください。 <|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>

✨ 主な機能

MistralLite開発の動機

Mistral-7B-Instruct-v0.1がリリースされて以来、このモデルは幅広いベンチマークでの高い性能のためにますます人気を博しています。しかし、ほとんどのベンチマークは短文脈で評価されており、長文脈タスクでの性能についてはあまり調査されていませんでした。そこで、我々はMistral-7B-Instruct-v0.1を、LLMの長文脈処理能力を評価するために特別に設計されたベンチマークで評価しました。モデルの長文脈での性能は4096トークン未満の長文脈ではかなり競争力がありましたが、より長い文脈での性能にはいくつかの制限がありました。長文脈での性能を向上させることを動機として、我々はMistral 7Bモデルをファインチューニングし、Mistralliteを作成しました。このモデルは、Mistral-7B-Instruct-v0.1に比べて長文脈処理の性能を大幅に向上させることができました。詳細な長文脈評価結果は以下の通りです。

トピック検索 | モデル名 | 入力長 | 入力長 | 入力長 | 入力長 | 入力長 | | ------ | ------ | ------ | ------ | ------ | ------ | | | 2851 | 5568 | 8313 | 11044 | 13780 | | Mistral-7B-Instruct-v0.1 | 100% | 50% | 2% | 0% | 0% | | MistralLite | 100% | 100% | 100% | 100% | 98% |
行検索 | モデル名 | 入力長 | 入力長 | 入力長 | 入力長 | 入力長 | 入力長 | | ------ | ------ | ------ | ------ | ------ | ------ | ------ | | | 3818 | 5661 | 7505 | 9354 | 11188 | 12657 | | Mistral-7B-Instruct-v0.1 | 98% | 62% | 42% | 42% | 32% | 30% | | MistralLite | 98% | 92% | 88% | 76% | 70% | 60% |
パスキー検索 | モデル名 | 入力長 | 入力長 | 入力長 | 入力長 | | ------ | ------ | ------ | ------ | ------ | | | 3264 | 5396 | 8329 | 10197 | | Mistral-7B-Instruct-v0.1 | 100% | 50% | 20% | 30% | | MistralLite | 100% | 100% | 100% | 100% |
長い入力テキストによる質問応答 | モデル名 | テストセットの精度 | ハードサブセットの精度 | | ------ | ------ | ------ | | Mistral-7B-Instruct-v0.1 | 44.3% | 39.7% | | MistralLite | 64.4% | 56.2% |

13400トークンの長文脈でのテスト例

文脈はAmazon Aurora FAQsから取得
質問: please tell me how does pgvector help with Generative AI and give me some examples.
MistralLiteからの回答:

pgvector is an open-source extension for PostgreSQL supported by Amazon Aurora PostgreSQL-Compatible Edition.

You can use pgvector to store, search, index, and query billions of embeddings that are generated from machine learning (ML) and artificial intelligence (AI) models in your database, such as those from Amazon Bedrock (limited preview) or Amazon SageMaker. A vector embedding is a numerical representation that represents the semantic meaning of content such as text, images, and video.

With pgvector, you can query embeddings in your Aurora PostgreSQL database to perform efficient semantic similarity searches of these data types, represented as vectors, combined with other tabular data in Aurora. This enables the use of generative AI and other AI/ML systems for new types of applications such as personalized recommendations based on similar text descriptions or images, candidate match based on interview notes, customer service next best action recommendations based on successful transcripts or chat session dialogs, and more.

📚 ドキュメント

モデル詳細

属性	详情
開発者	AWS Contributors
モデルタイプ	Mistral-7B-v0.1
言語	英語
ファインチューニング元の重み	Mistral-7B-v0.1
ファインチューニングに使用したデータ	SLidingEncoder and Decoder (SLED)、(Long) Natural Questions (NQ)、OpenAssistant Conversations Dataset (OASST1)
サポートされているサービングフレームワーク	Text-Generation-Inference 1.1.0、vLLM、HuggingFace transformers、HuggingFace Text Generation Inference (TGI) container on SageMaker
モデルのライセンス	Apache 2.0
問い合わせ先	GitHub issues
推論コード	Github Repo

MistralLite LM-Eval結果

方法論

詳細はhttps://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard を参照してください。
revision=4ececff
注意: --model hf-causal-experimentalを使用し、--model hf-causalは使用しませんでした。

結果

平均	hellaswag	arc_challenge	truthful_qa (mc2)	MMLU (acc)
0.57221	0.81617	0.58874	0.38275	0.5012

💻 使用例

Pythonコード（HuggingFace transformers）からMistralLiteを使用する方法

⚠️ 重要提示

エンドツーエンドのJupyterノートブックの例については、このリンクを参照してください。

必要なパッケージをインストールする

transformers 4.34.0以降、flash-attn 2.3.1.post1以降、およびaccelerate 0.23.0以降が必要です。

pip install transformers==4.34.0
pip install flash-attn==2.3.1.post1 --no-build-isolation
pip install accelerate==0.23.0

次に、以下のサンプルコードを試すことができます

from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch

model_id = "amazon/MistralLite"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             torch_dtype=torch.bfloat16,
                                             use_flash_attention_2=True,
                                             device_map="auto",)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)
prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"

sequences = pipeline(
    prompt,
    max_new_tokens=400,
    do_sample=False,
    return_full_text=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"{seq['generated_text']}")

⚠️ 重要提示

MistralLiteには以下のプロンプトテンプレートを使用してください。

<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>

TGIでMistralLiteをサービングする方法

⚠️ 重要提示

ネイティブのTGIコンテナを使用するエンドツーエンドのJupyterノートブックの例については、このリンクを参照してください。

入力文脈の長さが12Kトークンを超える場合、カスタムTGIコンテナの使用をお勧めします。詳細はこのリンクを参照してください。

TGIサーバーを起動する

TGIバージョン1.1.0以降を使用します。公式のDockerコンテナはghcr.io/huggingface/text-generation-inference:1.1.0です。 Dockerパラメータの例:

docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
      --model-id amazon/MistralLite \
      --max-input-length 16000 \
      --max-total-tokens 16384 \
      --max-batch-prefill-tokens 16384 \
      --trust-remote-code

推論を実行する

TGIで推論を行うPythonコードの例（text_generation 0.6.1以降が必要）:

pip install text_generation==0.6.1

from text_generation import Client

SERVER_PORT = 443
SERVER_HOST = "localhost"
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)

def invoke_tgi(prompt, 
                      random_seed=1, 
                      max_new_tokens=400, 
                      print_stream=True,
                      assist_role=True):
    if (assist_role):
        prompt = f"<|prompter|>{prompt}</s><|assistant|>"
    output = ""
    for response in tgi_client.generate_stream(
        prompt,
        do_sample=False,
        max_new_tokens=max_new_tokens,
        return_full_text=False,
        #temperature=None,
        #truncate=None,
        #seed=random_seed,
        #typical_p=0.2,
    ):
        if hasattr(response, "token"):
            if not response.token.special:
                snippet = response.token.text
                output += snippet
                if (print_stream):
                    print(snippet, end='', flush=True)
    return output

prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_tgi(prompt)

⚠️ 重要提示

MistralLiteを初めて推論に使用する場合、10秒程度の「ウォームアップ」期間が必要になることがあります。ただし、その後の推論は高速になり、よりタイムリーに結果を返すはずです。このウォームアップ期間は正常な現象であり、初期化期間が終了すればシステムの全体的なパフォーマンスに影響を与えません。

Amazon SageMakerでMistralLiteをデプロイする方法

⚠️ 重要提示

SageMakerの組み込みコンテナを使用するエンドツーエンドのJupyterノートブックの例については、このリンクを参照してください。

入力文脈の長さが12Kトークンを超える場合、カスタムDockerコンテナの使用をお勧めします。詳細はこのリンクを参照してください。

必要なパッケージをインストールする

sagemaker 2.192.1以降が必要です。

pip install sagemaker==2.192.1

モデルをSageMakerエンドポイントとしてデプロイする

MistralLiteをSageMakerエンドポイントにデプロイするには、以下のサンプルコードに従ってください。

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface", # or lmi
  region=region,
 version="1.1.0"
)

model_name = "MistralLite-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    'HF_MODEL_ID':'amazon/MistralLite',
    'HF_TASK':'text-generation',
    'SM_NUM_GPUS':'1',
    "MAX_INPUT_LENGTH": '16000',
    "MAX_TOTAL_TOKENS": '16384',
    "MAX_BATCH_PREFILL_TOKENS": '16384',
    "MAX_BATCH_TOTAL_TOKENS":  '16384',
}

model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)
predictor = model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.2xlarge",
  endpoint_name=model_name,
    
)

推論を実行する

エンドポイントを呼び出すには、以下のサンプルコードに従ってください。

input_data = {
  "inputs": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
  "parameters": {
    "do_sample": False,
    "max_new_tokens": 400,
    "return_full_text": False,
    #"typical_p": 0.2,
    #"temperature":None,
    #"truncate":None,
    #"seed": 1,
  }
}
result = predictor.predict(input_data)[0]["generated_text"]
print(result)

またはboto3を介して、以下のようなサンプルコードで呼び出すこともできます。

import boto3
import json
def call_endpoint(client, prompt, endpoint_name, paramters):
    client = boto3.client("sagemaker-runtime")
    payload = {"inputs": prompt,
               "parameters": parameters}
    response = client.invo