open-llama-3b-v2-wizard-evol-instuct-v2-196k-AWQ オープンソースモデル

ホーム

Open Llama 3b V2 Wizard Evol Instuct V2 196k AWQ

TheBlokeによって開発

これはOpen Llama 3B V2アーキテクチャに基づくモデルで、WizardLM_evol_instruct_V2_196kデータセットを使用して訓練され、命令追従タスクに適しています。

大規模言語モデル

Transformers

英語オープンソースライセンス:Apache-2.0 #命令ファインチューニング #小規模パラメータ効率化 #英語対話

ダウンロード数 64

リリース時間 : 11/29/2023

モデル概要

このモデルはOpen Llama 3B V2アーキテクチャで訓練された命令追従モデルで、WizardLMの進化型命令データセットに特化して最適化されています。

モデル特徴

命令最適化

WizardLMの進化型命令データセットで訓練され、命令追従能力を最適化

効率的な推論

3Bパラメータ規模で性能を維持しながら高速な推論を提供

オープンライセンス

Apache 2.0ライセンスを採用し、商業利用や研究利用を許可

モデル能力

テキスト生成

命令理解と実行

対話システム

質問応答システム

使用事例

対話システム

インテリジェントアシスタント

複雑な命令を理解できる対話アシスタントの構築

教育

教育支援

教材生成や学生の質問への回答に利用

🚀 Open Llama 3B V2 Wizard Evol Instuct V2 196K - AWQ

このモデルは、Open Llama 3B V2 Wizard Evol Instuct V2 196K をベースにAWQ量子化を施したものです。AWQ量子化により、高速かつ高精度な推論が可能になります。

🚀 クイックスタート

このセクションでは、モデルの概要と利用方法について説明します。

モデル情報

属性	详情
ベースモデル	harborwater/open-llama-3b-v2-wizard-evol-instuct-v2-196k
データセット	WizardLM/WizardLM_evol_instruct_V2_196k
推論	false
言語	en
ライブラリ名	transformers
ライセンス	apache-2.0
モデル作成者	L
モデル名	Open Llama 3B V2 Wizard Evol Instuct V2 196K
モデルタイプ	llama
プロンプトテンプレート	### HUMAN: {prompt} ### RESPONSE:
量子化者	TheBloke

モデルの概要

このリポジトリには、LのOpen Llama 3B V2 Wizard Evol Instuct V2 196K のAWQモデルファイルが含まれています。これらのファイルは、Massed Compute から提供されたハードウェアを使用して量子化されました。

AWQについて

AWQは、効率的で高精度かつ高速な低ビット重み量子化方法で、現在は4ビット量子化をサポートしています。GPTQと比較すると、一般的に使用されるGPTQ設定と同等またはそれ以上の品質で、Transformersベースの推論を高速化することができます。

AWQは以下のものでサポートされています。

Text Generation Webui - Loader: AutoAWQを使用
vLLM - LlamaとMistralモデルのみ
Hugging Face Text Generation Inference (TGI)
Transformers バージョン4.35.0以降、Transformersをサポートする任意のコードまたはクライアントから
AutoAWQ - Pythonコードから使用する場合

利用可能なリポジトリ

プロンプトテンプレート: Human-Response

### HUMAN:
{prompt}

### RESPONSE:

提供されるファイルとAWQパラメータ

現在は128g GEMMモデルのみをリリースしています。グループサイズ32のモデルとGEMVカーネルモデルの追加は、積極的に検討されています。

モデルはシャーディングされたsafetensorsファイルとしてリリースされます。

ブランチ	ビット数	グループサイズ	AWQデータセット	シーケンス長	サイズ
main	4	64	VMware Open Instruct	2048	2.15 GB

📦 インストール

text-generation-webuiでのモデルの簡単なダウンロードと使用方法

text-generation-webui の最新バージョンを使用していることを確認してください。

手動インストール方法を熟知していない限り、text-generation-webuiのワンクリックインストーラーを使用することを強くお勧めします。

Modelタブをクリックします。
Download custom model or LoRAの下に、TheBloke/open-llama-3b-v2-wizard-evol-instuct-v2-196k-AWQ を入力します。
Downloadをクリックします。
モデルのダウンロードが開始されます。完了すると「Done」と表示されます。
左上の Model 横の更新アイコンをクリックします。
Model のドロップダウンメニューから、先ほどダウンロードしたモデル open-llama-3b-v2-wizard-evol-instuct-v2-196k-AWQ を選択します。
Loader: AutoAWQ を選択します。
Load をクリックすると、モデルがロードされ、使用可能になります。
カスタム設定が必要な場合は、設定を行った後、右上の Save settings for this model をクリックし、続いて Reload the Model をクリックします。
準備ができたら、Text Generation タブをクリックし、プロンプトを入力して使用を開始します！

Pythonコードからの推論に必要なパッケージのインストール

Transformers 4.35.0以上が必要です。
AutoAWQ 0.1.6以上が必要です。

pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0"

PyTorch 2.0.1を使用している場合、上記のAutoAWQコマンドは自動的にPyTorch 2.1.0にアップグレードします。

CUDA 11.8を使用しており、PyTorch 2.0.1を引き続き使用したい場合は、代わりに以下のコマンドを実行してください。

pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

AutoAWQ をプレビルドのホイールを使用してインストールする際に問題が発生した場合は、ソースからインストールしてください。

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

💻 使用例

vLLMを使用したマルチユーザー推論サーバー

vLLMのインストールと使用方法に関するドキュメントはこちらにあります。

vLLMバージョン0.2以上を使用していることを確認してください。
vLLMをサーバーとして使用する場合は、--quantization awq パラメータを渡してください。

例:

python3 -m vllm.entrypoints.api_server --model TheBloke/open-llama-3b-v2-wizard-evol-instuct-v2-196k-AWQ --quantization awq --dtype auto

PythonコードからvLLMを使用する場合は、再度 quantization=awq を設定してください。

例:

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''### HUMAN:
{prompt}

### RESPONSE:
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/open-llama-3b-v2-wizard-evol-instuct-v2-196k-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Hugging Face Text Generation Inference (TGI)を使用したマルチユーザー推論サーバー

TGIバージョン1.1.0以上を使用してください。公式のDockerコンテナは ghcr.io/huggingface/text-generation-inference:1.1.0 です。

例のDockerパラメータ:

--model-id TheBloke/open-llama-3b-v2-wizard-evol-instuct-v2-196k-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

TGIとやり取りするための例のPythonコード（huggingface-hub 0.17.0以上が必要）:

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''### HUMAN:
{prompt}

### RESPONSE:
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

Transformersを使用したPythonコードからの推論

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name_or_path = "TheBloke/open-llama-3b-v2-wizard-evol-instuct-v2-196k-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    low_cpu_mem_usage=True,
    device_map="cuda:0"
)

# Using the text streamer to stream output one token at a time
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "Tell me about AI"
prompt_template=f'''### HUMAN:
{prompt}

### RESPONSE:
'''

# Convert prompt to tokens
tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

generation_params = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1
}

# Generate streamed output, visible one token at a time
generation_output = model.generate(
    tokens,
    streamer=streamer,
    **generation_params
)

# Generation without a streamer, which will include the prompt in the output
generation_output = model.generate(
    tokens,
    **generation_params
)

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("model.generate output: ", text_output)

# Inference is also possible via Transformers' pipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **generation_params
)

pipe_output = pipe(prompt_template)[0]['generated_text']
print("pipeline output: ", pipe_output)

🔧 技術詳細

互換性

提供されるファイルは以下のもので動作することがテストされています。

text-generation-webui で Loader: AutoAWQ を使用する場合
vLLM バージョン0.2.0以上
Hugging Face Text Generation Inference (TGI) バージョン1.1.0以上
Transformers バージョン4.35.0以上
AutoAWQ バージョン0.1.1以上

📄 ライセンス

このモデルは、apache-2.0ライセンスの下で提供されています。

元のモデルカード: L's Open Llama 3B V2 Wizard Evol Instuct V2 196K

WizardLM_evol_instruct_v2_196kデータセットを1エポックで学習したモデルです。

GGUF 形式へのリンク。

プロンプトテンプレート:

### HUMAN:
{prompt}

### RESPONSE:
<leave a newline for the model to answer>

おすすめAIモデル

Llama 3 Typhoon V1.5x 8b Instruct

タイ語専用に設計された80億パラメータの命令モデルで、GPT-3.5-turboに匹敵する性能を持ち、アプリケーションシナリオ、検索拡張生成、制限付き生成、推論タスクを最適化

Cadet-TinyはSODAデータセットでトレーニングされた超小型対話モデルで、エッジデバイス推論向けに設計されており、体積はCosmo-3Bモデルの約2％です。

Roberta Base Chinese Extractive Qa

RoBERTaアーキテクチャに基づく中国語抽出型QAモデルで、与えられたテキストから回答を抽出するタスクに適しています。

質問応答システム中国語

uer

2,694

未来を切り開く、あなたのAIソリューション知識ベース

English 简体中文繁體中文にほんご