Phi-4-mini-instruct-8da4w開源語言模型 - 適合移動端免費部署的實用工具

首頁

Phi 4 Mini Instruct 8da4w

由pytorch開發

Phi-4-mini 是由 PyTorch 團隊開發的量化語言模型，採用8位嵌入和8位動態激活，以及4位權重線性層（8da4w）的量化方案，適合移動端部署。

大型語言模型

Transformers

其他開源協議:MIT #移動端量化 #8da4w量化 #對話式AI

下載量 780

發布時間 : 4/7/2025

模型概述

Phi-4-mini 是一個輕量級的自然語言處理模型，適用於代碼生成、數學推理、聊天對話等多種任務。

模型特點

高效量化

採用8位嵌入和8位動態激活，以及4位權重線性層（8da4w）的量化方案，顯著減少模型大小和內存佔用。

移動端部署

支持通過 ExecuTorch 在移動設備上運行，適合資源受限的環境。

高性能推理

在 iPhone 15 Pro 上，模型運行速度為每秒17.3個令牌，內存佔用為3206 MB。

模型能力

文本生成

代碼生成

數學推理

聊天對話

使用案例

自然語言處理

聊天機器人

用於構建高效的聊天機器人，支持多輪對話。

響應速度快，適合移動端應用。

代碼輔助

幫助開發者生成代碼片段或解決編程問題。

支持多種編程語言，生成質量較高。

教育

數學輔導

用於解答數學問題或提供解題思路。

在 GSM8K 數據集上表現良好。

🚀 Phi-4-mini-instruct量化模型

Phi-4-mini-instruct量化模型基於microsoft/Phi-4-mini-instruct模型，由PyTorch團隊使用torchao進行量化處理。該模型採用8位嵌入和8位動態激活以及4位權重線性（8da4w）的量化方案，適用於使用ExecuTorch進行移動端部署。我們提供了可直接在ExecuTorch中使用的量化pte文件。

✨ 主要特性

量化處理：使用torchao進行8位嵌入和8位動態激活以及4位權重線性（8da4w）的量化，減少模型內存佔用。
移動端部署：可在移動設備上使用ExecuTorch運行，如iPhone 15 Pro。
多語言支持：支持多種語言的文本生成任務。

📦 安裝指南

首先，你需要安裝所需的包：

pip install git+https://github.com/huggingface/transformers@main
pip install torchao

💻 使用示例

基礎用法

以下是如何在移動應用中運行模型的示例：

# 下載pte文件
wget https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte

# 在iOS上運行的說明
https://pytorch.org/executorch/main/llm/llama-demo-ios.html

高級用法

以下是量化模型的詳細步驟：

解綁嵌入權重

from transformers import (
  AutoModelForCausalLM,
  AutoProcessor,
  AutoTokenizer,
)
import torch

model_id = "microsoft/Phi-4-mini-instruct"
untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(untied_model)
from transformers.modeling_utils import find_tied_parameters
print("tied weights:", find_tied_parameters(untied_model))
if getattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings"):
    setattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings", False)

untied_model._tied_weights_keys = []
untied_model.lm_head.weight = torch.nn.Parameter(untied_model.lm_head.weight.clone())

print("tied weights:", find_tied_parameters(untied_model))

USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-untied-weights"

untied_model.push_to_hub(save_to)
tokenizer.push_to_hub(save_to)

# or save locally
save_to_local_path = f"{MODEL_NAME}-untied-weights"
untied_model.save_pretrained(save_to_local_path)
tokenizer.save_pretrained(save_to)

量化模型

from transformers import (
  AutoModelForCausalLM,
  AutoProcessor,
  AutoTokenizer,
  TorchAoConfig,
)
from torchao.quantization.quant_api import (
    IntxWeightOnlyConfig,
    Int8DynamicActivationIntxWeightConfig,
    AOPerModuleConfig,
    quantize_,
)
from torchao.quantization.granularity import PerGroup, PerAxis
import torch

# we start from the model with untied weights
model_id = "microsoft/Phi-4-mini-instruct"
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
untied_model_id = f"{USER_ID}/{MODEL_NAME}-untied-weights"
untied_model_local_path = f"{MODEL_NAME}-untied-weights"

embedding_config = IntxWeightOnlyConfig(
    weight_dtype=torch.int8,
    granularity=PerAxis(0),
)
linear_config = Int8DynamicActivationIntxWeightConfig(
    weight_dtype=torch.int4,
    weight_granularity=PerGroup(32),
    weight_scale_dtype=torch.bfloat16,
)
quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])

# either use `untied_model_id` or `untied_model_local_path`
quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Push to hub
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-untied-8da4w"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {
        "role": "system",
        "content": "",
    },
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
    templated_prompt,
    return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])

📚 詳細文檔

模型質量評估

我們使用lm-evaluation-harness來評估量化模型的質量。

安裝lm-eval

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

基準測試

# 基線模型
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8

# 8da4w量化模型
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-8da4w --tasks hellaswag --device cuda:0 --batch_size 8

評估結果

基準測試	Phi-4-mini-ins	Phi-4-mini-instruct-8da4w
流行綜合基準測試
mmlu (0 shot)	66.73	60.75
mmlu_pro (5-shot)	46.43	11.75
推理能力
arc_challenge	56.91	48.46
gpqa_main_zeroshot	30.13	30.80
hellaswag	54.57	50.35
openbookqa	33.00	30.40
piqa (0-shot)	77.64	74.43
siqa	49.59	44.98
truthfulqa_mc2 (0-shot)	48.39	51.35
winogrande (0-shot)	71.11	70.32
多語言能力
mgsm_en_cot_en	60.80	57.60
數學能力
gsm8k (5-shot)	81.88	61.71
Mathqa (0-shot)	42.31	36.95
總體表現	55.35	48.45

導出到ExecuTorch

我們可以使用ExecuTorch在移動設備上運行量化模型。

轉換檢查點

python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin

導出到pte格式

PARAMS="executorch/examples/models/phi_4_mini/config.json"
python -m executorch.examples.models.llama.export_llama \
  --model "phi_4_mini" \
  --checkpoint "pytorch_model_converted.bin" \
  --params "$PARAMS" \
  -kv \
  --use_sdpa_with_kv_cache \
  -X \
  --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' \
  --max_seq_length 128 \
  --max_context_length 128 \
  --output_name="phi4-mini-8da4w.pte"