Mixtral-8x7B-Instruct-v0.1オープンソースAIモデル - 多くのテストでLlama 2 70Bを超え、実用性が高い

ホーム

Mixtral 8x7B Instruct V0.1

mistralaiによって開発

Mixtral-8x7Bは事前学習済みの生成的スパース混合エキスパートモデルで、ほとんどのベンチマークテストでLlama 2 70Bを上回る性能を示しています。

大規模言語モデル

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #スパース混合エキスパート #多言語命令ファインチューニング #長文脈推論

ダウンロード数 505.97k

リリース時間 : 12/10/2023

モデル概要

高性能な多言語大規模言語モデルで、命令追従と生成タスクをサポート

モデル特徴

スパース混合エキスパートアーキテクチャ

8つの7Bパラメータエキスパートモデルの混合アーキテクチャを採用し、推論時には一部のエキスパートのみを活性化することで効率的な計算を実現

多言語サポート

フランス語、イタリア語、ドイツ語、スペイン語、英語の5つの主要なヨーロッパ言語をネイティブサポート

高性能

ほとんどのベンチマークテストでLlama 2 70Bモデルを上回る性能

命令最適化

命令追従能力を特別に最適化しており、対話システムやアシスタントアプリケーションの構築に適している

モデル能力

多言語テキスト生成

対話システム構築

命令理解と実行

知識質問応答

コンテンツ作成

使用事例

対話システム

インテリジェントアシスタント

多言語インテリジェントアシスタントを構築し、ユーザーの命令を理解して応答

自然で流暢なマルチターン対話が可能

コンテンツ生成

多言語コンテンツ作成

複数言語でのマーケティングコピーや記事などのコンテンツを生成

高品質で言語慣習に合致したテキスト出力

🚀 Mixtral-8x7Bのモデルカード

Mixtral-8x7Bは事前学習された生成型の大規模言語モデル（LLM）で、専門家の疎な混合モデルです。このモデルは、多くのベンチマークでLlama 2 70Bを上回る性能を発揮します。

🚀 クイックスタート

🔍 モデル情報

属性	詳情
モデルタイプ	大規模言語モデル（LLM）
ベースモデル	mistralai/Mixtral-8x7B-v0.1
サポート言語	fr, it, de, es, en
ライセンス	apache-2.0

💻 使用例

基本的な使用法

mistral-commonを使用したトークン化の例です。

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
 
mistral_models_path = "MISTRAL_MODELS_PATH"
 
tokenizer = MistralTokenizer.v1()
 
completion_request = ChatCompletionRequest(messages=[UserMessage(content="Explain Machine Learning to me in a nutshell.")])
 
tokens = tokenizer.encode_chat_completion(completion_request).tokens

高度な使用法

mistral_inferenceを使用した推論の例です。

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
 
model = Transformer.from_folder(mistral_models_path)
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)

result = tokenizer.decode(out_tokens[0])

print(result)

また、Hugging Faceのtransformersを使用した推論の例です。

from transformers import AutoModelForCausalLM
 
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
model.to("cuda")
 
generated_ids = model.generate(tokens, max_new_tokens=1000, do_sample=True)

# decode with mistral tokenizer
result = tokenizer.decode(generated_ids[0].tolist())
print(result)

💡 使用ヒント PRs to correct the transformers tokenizer so that it gives 1-to-1 the same results as the mistral-common reference implementation are very welcome!

📚 ドキュメント

⚠️ 注意事項

このリポジトリのモデルは、元のMixtral torrent releaseに基づいていますが、ファイル形式とパラメータ名が異なります。現時点では、HFでモデルをインスタンス化することはできません。

📝 命令形式

このモデルの命令形式は、以下の通りです。この形式を厳密に守らないと、最適ではない出力が生成されます。

<s> [INST] 命令 [/INST] モデルの回答</s> [INST] 続きの命令 [/INST]

<s>と</s>は、文字列の開始（BOS）と終了（EOS）の特殊トークンで、[INST]と[/INST]は通常の文字列です。

微調整時に命令をトークン化するために使用される疑似コードは、以下の通りです。

def tokenize(text):
    return tok.encode(text, add_special_tokens=False)

[BOS_ID] + 
tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_1) + [EOS_ID] +
…
tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_N) + [EOS_ID]

上記の疑似コードでは、tokenizeメソッドは自動的にBOSまたはEOSトークンを追加せず、プレフィックススペースを追加する必要があります。

Transformersライブラリでは、chat templatesを使用して、正しい形式が適用されるようにすることができます。

🚀 モデルの実行

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

💾 メモリ最適化

デフォルトでは、transformersはモデルを完全精度でロードします。そのため、HFエコシステムで提供される最適化を通じて、モデルを実行するためのメモリ要件をさらに削減することができます。

半精度でのロード

float16精度は、GPUデバイスでのみ動作します。

クリックして展開

+ import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

低精度（8ビットと4ビット）でのロード

bitsandbytesを使用して、低精度でモデルをロードすることができます。

クリックして展開

+ import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

+ model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")

text = "Hello my name is"
messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Flash Attention 2でのロード

Flash Attention 2を使用して、モデルをロードすることができます。

クリックして展開

+ import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

+ model = AutoModelForCausalLM.from_pretrained(model_id, use_flash_attention_2=True, device_map="auto")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📄 制限事項

Mixtral-8x7B Instructモデルは、ベースモデルを簡単に微調整して魅力的な性能を達成できることを迅速に実証するためのものです。このモデルには、モデレーションメカニズムがありません。私たちは、モデルがガードレールをきめ細かく尊重し、モデレーションされた出力が必要な環境でのデプロイを可能にする方法について、コミュニティと協力することを楽しみにしています。

📄 ライセンス

このモデルは、Apache 2.0ライセンスの下で提供されています。

👥 チームメンバー

Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Louis Ternon, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.