ホーム

Phi 4 Mini Instruct Float8dq

pytorchによって開発

Phi-4-mini-instructモデルはtorchaoによるfloat8動的活性化と重みの量子化を経ており、H100上で36%のVRAM削減と15-20%の速度向上を実現し、精度にはほとんど影響を与えません。

大規模言語モデル

Transformers

その他オープンソースライセンス:MIT #float8量子化 #効率的な推論 #多言語対話

ダウンロード数 1,006

リリース時間 : 4/8/2025

モデル概要

Microsoft Phi-4-mini-instructを基にした量子化バージョンで、テキスト生成タスクに適しており、多言語インタラクションと数学的推論をサポートします。

モデル特徴

効率的な量子化

float8動的活性化と重みの量子化技術を採用し、VRAM使用量を大幅に削減

性能最適化

H100上で15-20%の推論速度向上を実現

マルチタスクサポート

コード生成、数学的推論、対話タスクをサポート

精度保持

量子化後のモデル精度損失は極めて小さい（ベンチマークテストでは全体の性能がわずか0.24%低下）

モデル能力

テキスト生成

数学問題解決

コード生成

多言語対話

論理的推論

使用事例

教育支援

数学問題解答

学生が代数方程式の解法を理解するのを支援

2x+3=7のような方程式を正しく解答可能

クリエイティブ生成

レシピ提案

フルーツの組み合わせに関するクリエイティブなレシピを生成

バナナとドラゴンフルーツのスムージーなどの具体的な提案を提供

技術質問応答

プログラミング支援

コードロジックの説明やコードスニペットの生成

library_name: transformers tags:

torchao
phi
phi4
nlp
code
math
chat
conversational license: mit language:
multilingual base_model:
microsoft/Phi-4-mini-instruct pipeline_tag: text-generation

Phi4-miniモデルをtorchaoのfloat8動的活性化とfloat8重み量子化（行単位の粒度）で量子化したもの。PyTorchチームによる。直接使用するか、vLLMで使用して、VRAMを36%削減、15-20%の高速化、H100での精度への影響をほとんどなくすことができます。

vLLMでの推論

最近の変更を取得するためにvllm nightlyをインストール：

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install torchao

コード例

from vllm import LLM, SamplingParams

# サンプルプロンプト
prompts = [
    "こんにちは、私の名前は",
    "アメリカ合衆国の大統領は",
    "フランスの首都は",
    "AIの未来は",
]
# サンプリングパラメータオブジェクトを作成
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)


if __name__ == '__main__':
    # LLMを作成
    llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq")
    # プロンプトからテキストを生成
    # 出力はRequestOutputオブジェクトのリストで、
    # プロンプト、生成されたテキスト、その他の情報を含む
    outputs = llm.generate(prompts, sampling_params)
    # 出力を表示
    print("\n生成された出力:\n" + "-" * 60)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"プロンプト:    {prompt!r}")
        print(f"出力:    {generated_text!r}")
        print("-" * 60)

注意：このコードを実行する際は、VLLM_DISABLE_COMPILE_CACHE=1を使用してコンパイルキャッシュを無効にしてください。例：VLLM_DISABLE_COMPILE_CACHE=1 python example.py。vLLMとtorchaoのコンパイルの互換性に問題があるため、pytorch 2.8で解決される予定です。

サーバー提供

以下のコマンドでサーバーを提供できます：

vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

Transformersでの推論

必要なパッケージをインストール：

pip install git+https://github.com/huggingface/transformers@main
pip install torchao
pip install torch
pip install accelerate

例：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 
torch.random.manual_seed(0)

model_path = "pytorch/Phi-4-mini-instruct-float8dq"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
 
messages = [
    {"role": "system", "content": "あなたは役立つAIアシスタントです。"},
    {"role": "user", "content": "バナナとドラゴンフルーツの組み合わせで食べる方法を教えてください。"},
    {"role": "assistant", "content": "はい！バナナとドラゴンフルーツを一緒に食べる方法をいくつか紹介します：1. バナナとドラゴンフルーツのスムージー：バナナとドラゴンフルーツをミルクと蜂蜜と一緒にブレンドします。2. バナナとドラゴンフルーツのサラダ：スライスしたバナナとドラゴンフルーツをレモン汁と蜂蜜と一緒に混ぜます。"},
    {"role": "user", "content": "2x + 3 = 7の方程式を解いてください。"},
]
 
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)
 
generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}
 
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

量子化レシピ

必要なパッケージをインストール：

pip install git+https://github.com/huggingface/transformers@main
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
pip install torch
pip install accelerate

以下のコードを使用して量子化モデルを取得：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "microsoft/Phi-4-mini-instruct"

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# ハブにプッシュ
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

# 手動テスト
prompt = "こんにちは、意識はありますか？話せますか？"
messages = [
    {
        "role": "system",
        "content": "",
    },
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print("プロンプト:", prompt)
print("テンプレート化されたプロンプト:", templated_prompt)
inputs = tokenizer(
    templated_prompt,
    return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("レスポンス:", output_text[0][len(prompt):])

注意：push_to_hubするには、

pip install -U "huggingface_hub[cli]"
huggingface-cli login

を実行し、https://huggingface.co/settings/tokens から書き込み権限のあるトークンを使用してください。

モデル品質

量子化モデルの品質を評価するためにlm-evaluation-harnessを使用しています。 lm-evalをソースからインストールする必要があります： https://github.com/EleutherAI/lm-evaluation-harness#install

ベースライン

lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8

float8動的活性化とfloat8重み量子化（float8dq）

lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8

ベンチマーク
	Phi-4-mini-ins	Phi-4-mini-instruct-float8dq
人気のある集約ベンチマーク
mmlu (0-shot)	66.73	66.61
mmlu_pro (5-shot)	46.43	44.58
推論
arc_challenge (0-shot)	56.91	56.66
gpqa_main_zeroshot	30.13	29.46
HellaSwag	54.57	54.55
openbookqa	33.00	33.60
piqa (0-shot)	77.64	77.48
social_iqa	49.59	49.28
truthfulqa_mc2 (0-shot)	48.39	48.09
winogrande (0-shot)	71.11	72.77
多言語
mgsm_en_cot_en	60.8	60.0
数学
gsm8k (5-shot)	81.88	80.89
mathqa (0-shot)	42.31	42.51
全体	55.35	55.11

ピークメモリ使用量

結果

ベンチマーク
	Phi-4 mini-Ins	Phi-4-mini-instruct-float8dq
ピークメモリ (GB)	8.91	5.70 (36%削減)

ベンチマークピークメモリ

推論中のピークメモリ使用量を把握するために以下のコードを使用できます：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

# "microsoft/Phi-4-mini-instruct" または "pytorch/Phi-4-mini-instruct-float8dq"を使用
model_id = "pytorch/Phi-4-mini-instruct-float8dq"
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_id)

torch.cuda.reset_peak_memory_stats()

prompt = "こんにちは、意識はありますか？話せますか？"
messages = [
    {
        "role": "system",
        "content": "",
    },
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print("プロンプト:", prompt)
print("テンプレート化されたプロンプト:", templated_prompt)
inputs = tokenizer(
    templated_prompt,
    return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("レスポンス:", output_text[0][len(prompt):])

mem = torch.cuda.max_memory_reserved() / 1e9
print(f"ピークメモリ使用量: {mem:.02f} GB")

モデルパフォーマンス

結果 (H100マシン)

ベンチマーク
	Phi-4 mini-Ins	Phi-4-mini-instruct-float8dq
レイテンシ (batch_size=1)	1.64s	1.41s (16%高速化)
レイテンシ (batch_size=128)	3.1s	2.72s (14%高速化)
サーバー提供 (num_prompts=1)	1.35 req/s	1.57 req/s (16%高速化)
サーバー提供 (num_prompts=1000)	66.68 req/s	80.53 req/s (21%高速化)

レイテンシ（benchmark_latency）の結果は秒単位、サーバー提供（benchmark_serving）は1秒あたりのリクエスト数です。

セットアップ

vllmソースコードを取得：

git clone git@github.com:vllm-project/vllm.git

vllmをインストール

VLLM_USE_PRECOMPILED=1 pip install --editable .

vllmルートフォルダでベンチマークを実行：

benchmark_latency

ベースライン

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1

float8dq

VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1

benchmark_serving

サーバー環境でのスループットをベンチマークしました。

sharegptデータセットをダウンロード：

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

他のデータセットはこちらで見つかります：https://github.com/vllm-project/vllm/tree/main/benchmarks

注意：benchmark_servingスクリプトの--num-prompts引数でベンチマークするプロンプト数を変更できます。

ベースライン

サーバー：

vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3

クライアント：

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1

float8dq

サーバー：

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

クライアント：

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1