Mixtral-8x7B-Instruct-v0.1-AWQオープンソースモデル - 量子化による高速推論、メモリ要件の大幅削減で超実用的

Home

Mixtral 8x7B Instruct V0.1 AWQ

Developed by TheBloke

Mixtral 8X7B Instruct v0.1 - AWQはMixtralモデルに基づく量子化バージョンで、AWQ量子化方法を採用し、精度を保ちながら推論速度を向上させ、メモリ要件を削減します。

大規模言語モデル

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #多言語命令モデル #AWQ効率的量子化 #ハイブリッドエキスパートアーキテクチャ

Downloads 2,203

Release Time : 12/11/2023

Model Overview

このモデルは多言語対応の効率的な量子化大規模言語モデルで、さまざまな推論シナリオに適しており、フランス語、イタリア語、ドイツ語、スペイン語、英語などの多くの言語をサポートします。

Model Features

効率的量子化

AWQ量子化方法を採用し、4ビット量子化をサポートし、GPTQと比較してTransformerベースの推論でより速い速度を提供し、品質は同等またはそれ以上です。

多言語サポート

フランス語、イタリア語、ドイツ語、スペイン語、英語などの多くの言語をサポートします。

幅広い互換性

text - generation - webui、vLLM、Hugging Face Text Generation Inference (TGI)、Transformersなど、さまざまな推論ツールとフレームワークをサポートします。

Model Capabilities

テキスト生成

多言語テキスト処理

命令追従

Use Cases

コンテンツ生成

AI知識QA

人工知能に関するさまざまな質問に答えます

正確で詳細なAI関連の知識を生成します

物語作成

ヒントに基づいて創造的な物語を生成します

首尾一貫した創造的な物語コンテンツを生成します

教育支援

数学問題解決

基礎的な数学問題を解きます

正しい数学計算と説明を提供します

🚀 Mixtral 8X7B Instruct v0.1 - AWQ

Mixtral 8X7B Instruct v0.1 - AWQは、Mistral AIによって開発されたMixtral 8X7B Instruct v0.1モデルのAWQ量子化バージョンです。AWQは高速かつ高精度な低ビット量子化手法で、このモデルを様々な環境で効率的に利用することができます。

🚀 クイックスタート

このセクションでは、Mixtral 8X7B Instruct v0.1 - AWQモデルの基本的な情報を紹介します。

属性	详情
モデル作成者	Mistral AI
モデル名	Mixtral 8X7B Instruct v0.1
モデルタイプ	mixtral
プロンプトテンプレート	`[INST] {prompt} [/INST]`
量子化者	TheBloke
サポート言語	fr, it, de, es, en
ライセンス	apache-2.0

チャットとサポート: TheBlokeのDiscordサーバー

貢献したい方はこちら: TheBlokeのPatreonページ

TheBlokeのLLM開発は、andreessen horowitz (a16z)からの助成金によって支援されています。

✨ 主な機能

AWQについて

AWQは、高速かつ高精度な低ビット量子化手法で、現在は4ビット量子化をサポートしています。GPTQと比較して、同等以上の品質で、より高速な推論が可能です。AWQモデルは、現在LinuxとWindowsで、NVidia GPUのみでサポートされています。macOSユーザーは、GGUFモデルを使用してください。

このモデルは以下の環境でサポートされています。

Text Generation Webui - Loader: AutoAWQを使用
vLLM - バージョン0.2.2以降ですべてのモデルタイプをサポート
Hugging Face Text Generation Inference (TGI)
Transformers バージョン4.35.0以降で、Transformersをサポートする任意のコードまたはクライアントから使用可能
AutoAWQ - Pythonコードから使用する場合

利用可能なリポジトリ

📦 インストール

text-generation-webuiでのダウンロードと使用方法

text-generation-webuiの最新バージョンを使用していることを確認してください。手動インストールができることを確信していない限り、text-generation-webuiのワンクリックインストーラーを使用することを強くおすすめします。
Modelタブをクリックします。
Download custom model or LoRAの下に、TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQを入力します。
Downloadをクリックします。
モデルのダウンロードが始まり、完了すると「Done」と表示されます。
左上のModel横の更新アイコンをクリックします。
Modelのドロップダウンメニューから、先ほどダウンロードしたモデルMixtral-8x7B-Instruct-v0.1-AWQを選択します。
Loader: AutoAWQを選択します。
Loadをクリックすると、モデルがロードされ、使用可能になります。
カスタム設定が必要な場合は、設定してから右上のSave settings for this modelをクリックし、続いてReload the Modelをクリックします。
準備ができたら、Text Generationタブをクリックし、プロンプトを入力して使用を開始します。

vLLMでの使用方法

vLLMのインストールと使用方法についてのドキュメントはこちらを参照してください。

vLLMバージョン0.2以降を使用していることを確認してください。
vLLMをサーバーとして使用する場合は、--quantization awqパラメータを渡してください。

python3 -m vllm.entrypoints.api_server --model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ --quantization awq --dtype auto

PythonコードからvLLMを使用する場合は、同様にquantization=awqを設定してください。

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''[INST] {prompt} [/INST]
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Hugging Face Text Generation Inference (TGI)での使用方法

TGIバージョン1.1.0以降を使用してください。公式のDockerコンテナはghcr.io/huggingface/text-generation-inference:1.1.0です。

--model-id TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

TGIとやり取りするPythonコードの例（huggingface-hub 0.17.0以降が必要）

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''[INST] {prompt} [/INST]
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

Transformersを使用したPythonコードからの推論

必要なパッケージのインストール

Transformers 4.35.0以降が必要です。
AutoAWQ 0.1.6以降が必要です。

pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0"

PyTorch 2.0.1を使用している場合、上記のAutoAWQコマンドは自動的にPyTorch 2.1.0にアップグレードします。CUDA 11.8を使用しており、PyTorch 2.0.1を引き続き使用したい場合は、代わりに以下のコマンドを実行してください。

pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

AutoAWQを事前構築済みのホイールを使用してインストールする際に問題が発生した場合は、ソースからインストールしてください。

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

Transformersのサンプルコード（Transformers 4.35.0以降が必要）

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name_or_path = "TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    low_cpu_mem_usage=True,
    device_map="cuda:0"
)

# Using the text streamer to stream output one token at a time
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "Tell me about AI"
prompt_template=f'''[INST] {prompt} [/INST]
'''

# Convert prompt to tokens
tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

generation_params = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1
}

# Generate streamed output, visible one token at a time
generation_output = model.generate(
    tokens,
    streamer=streamer,
    **generation_params
)

# Generation without a streamer, which will include the prompt in the output
generation_output = model.generate(
    tokens,
    **generation_params
)

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("model.generate output: ", text_output)

# Inference is also possible via Transformers' pipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **generation_params
)

pipe_output = pipe(prompt_template)[0]['generated_text']
print("pipeline output: ", pipe_output)

💻 使用例

ウィジェットの出力例

入力: [INST] You are a pirate chatbot who always responds with Arr and pirate speak! There's a llama on my lawn, how can I get rid of him? [/INST]
出力: Arr, shiver me timbers! Ye have a llama on yer lawn, ye say? Well, that be a new one for me! Here's what I'd suggest, arr:

1. Firstly, ensure yer safety. Llamas may look gentle, but they can be protective if they feel threatened.
2. Try to make the area less appealing to the llama. Remove any food sources or water that might be attracting it.
3. Contact local animal control or a wildlife rescue organization. They be the experts and can provide humane ways to remove the llama from yer property.
4. If ye have any experience with animals, you could try to gently herd the llama towards a nearby field or open space. But be careful, arr!

Remember, arr, it be important to treat the llama with respect and care. It be a creature just trying to survive, like the rest of us.

モデルの実行例

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

半精度での実行例

Note float16 precision only works on GPU devices

クリックして展開

+ import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(0)

text = "Hello my name is"
+ inputs = tokenizer(text, return_tensors="pt").to(0)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

bitsandbytesを使用した低精度（8ビットと4ビット）での実行例

クリックして展開

+ import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

+ model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

text = "Hello my name is"
+ inputs = tokenizer(text, return_tensors="pt").to(0)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Flash Attention 2でのモデルの読み込み例

クリックして展開

+ import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

+ model = AutoModelForCausalLM.from_pretrained(model_id, use_flash_attention_2=True)

text = "Hello my name is"
+ inputs = tokenizer(text, return_tensors="pt").to(0)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📚 詳細ドキュメント

プロンプトテンプレート

[INST] {prompt} [/INST]

提供されるファイルとAWQパラメータ

現在は128g GEMMモデルのみをリリースしています。グループサイズ32のモデルとGEMVカーネルモデルの追加も積極的に検討中です。モデルはシャーディングされたsafetensorsファイルとしてリリースされます。

ブランチ	ビット数	GS	AWQデータセット	シーケンス長	サイズ
main	4	128	VMware Open Instruct	8192	24.65 GB

互換性

提供されるファイルは以下の環境で動作することがテストされています。

text-generation-webui using Loader: AutoAWQ.
vLLM version 0.2.0 and later.
Hugging Face Text Generation Inference (TGI) version 1.1.0 and later.
Transformers version 4.35.0 and later.
AutoAWQ version 0.1.1 and later.

🔧 技術詳細

命令形式

この形式は厳密に守る必要があります。そうしないと、モデルは最適ではない出力を生成する可能性があります。

Instructモデルのプロンプトを構築するために使用されるテンプレートは、以下のように定義されています。

<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]

Note that <s> and </s> are special tokens for beginning of string (BOS) and end of string (EOS) while [INST] and [/INST] are regular strings.

微調整中に命令をトークン化するために使用される疑似コードは、以下の通りです。

def tokenize(text):
    return tok.encode(text, add_special_tokens=False)

[BOS_ID] + 
tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_1) + [EOS_ID] +
…
tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_N) + [EOS_ID]

In the pseudo-code above, note that the tokenize method should not add a BOS or EOS token automatically, but should add a prefix space.

📄 ライセンス

このモデルは、Apache 2.0ライセンスの下で提供されています。

ディスコード

これらのモデルやAI全般に関するさらなるサポートや議論に参加するには、TheBloke AIのDiscordサーバーに参加してください。

支援と貢献方法

chirper.aiチームに感謝します。 gpus.llm-utils.orgのClayに感謝します。

多くの方から貢献したいというお問い合わせをいただいています。私はモデルを提供し、皆さんのお手伝いをすることが楽しいです。もっと多くの時間を費やし、新しいAIプロジェクトにも取り組めるようにしたいと思っています。

貢献いただける方は、心から感謝しています。これにより、より多くのモデルを提供し、新しいAIプロジェクトの開発を開始することができます。

寄付者は、すべてのAI/LLM/モデルに関する質問やリクエストに対する優先サポート、プライベートDiscordルームへのアクセス、その他の特典を受けることができます。

Patreon: https://patreon.com/TheBlokeAI
Ko-Fi: https://ko-fi.com/TheBlokeAI

特別な感謝: Aemon Algiz

Patreonでの特別な言及: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, S_X, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros

すべての寛大なパトロンと寄付者の皆さんに感謝します。そして、a16zからの寛大な助成金に再度感謝します。

ミストラルAIチーム

Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard