Phi-4-multimodal-instruct開源多模態模型 - 支持多形式輸入生成文本內容

首頁

Phi 4 Multimodal Instruct

由Robeeeeeeeeeee開發

Phi-4-multimodal-instruct是一款輕量級開源多模態基礎模型，融合了Phi-3.5和4.0模型的語言、視覺及語音研究與數據集。支持文本、圖像和音頻輸入，生成文本輸出，並具備128K標記的上下文長度。

多模態融合

Transformers

支持多種語言開源協議:MIT #多模態指令理解 #128K長上下文 #語音視覺文本三模態

下載量 21

發布時間 : 2/28/2025

模型概述

該模型通過監督微調、直接偏好優化及人類反饋強化學習（RLHF）的增強流程，在指令遵循精確性和安全措施方面表現優異。適用於廣泛的商業與研究用途，支持多語言和多模態任務。

模型特點

多模態支持

同時支持文本、圖像和音頻輸入，生成文本輸出，實現跨模態理解和交互。

長上下文處理

具備128K標記的上下文長度，能夠處理長文檔和複雜對話。

多語言能力

支持23種語言的文本處理，8種語言的音頻處理，具備強大的跨語言能力。

輕量級設計

優化後的架構適合內存/計算受限環境和低延遲場景。

強化學習優化

通過監督微調、直接偏好優化及人類反饋強化學習（RLHF）增強模型性能。

模型能力

文本生成

圖像理解

語音識別

語音翻譯

語音摘要

視覺問答

光學字符識別

圖表與表格理解

多圖像對比

視頻片段摘要

音頻理解

函數與工具調用

數學與邏輯推理

使用案例

語音處理

語音識別

將語音轉換為文本，支持多種語言。

詞錯誤率低至6.14%，在Huggingface OpenASR排行榜位列第一。

語音翻譯

即時將一種語言的語音翻譯為另一種語言的文本。

性能超越WhisperV3和SeamlessM4T-v2-Large。

語音摘要

從語音內容中提取關鍵信息生成摘要。

性能接近GPT4o。

視覺理解

視覺問答

根據圖像內容回答相關問題。

在AI2D基準測試中得分68.9，接近Gemini-2.0-Flash。

數學問題求解

通過視覺輸入解決複雜數學問題。

展示強大的圖像處理與方程求解能力。

智能助手

旅行規劃

通過語音分析幫助規劃旅行路線。

展示高級音頻處理與推薦能力。

內容創作

根據多模態輸入生成故事或內容。

在故事活現演示中展示創意生成能力。

🚀 Phi-4-multimodal-instruct

Phi-4-multimodal-instruct 是一款輕量級的開源多模態基礎模型，它利用了為 Phi-3.5 和 4.0 模型所做的語言、視覺和語音研究及數據集。該模型能夠處理文本、圖像和音頻輸入，並生成文本輸出，擁有 128K 的上下文長度。此外，該模型經過了增強處理，結合了監督微調、直接偏好優化和基於人類反饋的強化學習（RLHF），以支持精確的指令遵循和安全措施。

🚀 快速開始

Phi-4-multimodal-instruct 模型可用於廣泛的多語言和多模態商業及研究用途。以下是使用該模型的基本步驟：

環境準備：確保安裝了所需的 Python 包，如 transformers、torch 等。
模型加載：從 Hugging Face 下載模型權重，並使用 transformers 庫加載模型。
輸入處理：根據不同的任務，準備相應的文本、圖像或音頻輸入。
推理生成：使用加載的模型對輸入進行推理，生成文本輸出。

以下是一個簡單的代碼示例，展示瞭如何在本地加載模型並進行推理：

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen

# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

✨ 主要特性

多模態支持：能夠處理文本、圖像和音頻輸入，並生成文本輸出。
多語言能力：支持多種語言，包括阿拉伯語、中文、捷克語、丹麥語等。
強大的推理能力：在數學和邏輯推理方面表現出色，支持函數和工具調用。
視覺理解：具備一般圖像理解、光學字符識別、圖表和表格理解等能力。
語音處理：支持語音識別、翻譯、問答和總結等任務。

📦 安裝指南

依賴安裝

Phi-4 系列已集成在 transformers 的 4.48.2 版本中。可以使用以下命令驗證當前 transformers 版本：

pip list | grep transformers

以下是所需的 Python 包示例：

flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2

模型下載

可以從 Hugging Face 下載 Phi-4-multimodal-instruct 模型的權重：

from transformers import AutoModelForCausalLM, AutoProcessor

model_path = "microsoft/Phi-4-multimodal-instruct"
model = AutoModelForCausalLM.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

💻 使用示例

基礎用法

以下是一個簡單的文本生成示例：

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

model_path = "microsoft/Phi-4-multimodal-instruct"
model = AutoModelForCausalLM.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
generation_config = GenerationConfig.from_pretrained(model_path)

prompt = "<|user|>How to explain Internet for a medieval knight?<|end|><|assistant|>"
inputs = processor(text=prompt, return_tensors='pt')
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)

高級用法

以下是一個處理圖像和音頻輸入的示例：

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen

model_path = "microsoft/Phi-4-multimodal-instruct"
model = AutoModelForCausalLM.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
generation_config = GenerationConfig.from_pretrained(model_path)

# Image processing
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
image = Image.open(requests.get(image_url, stream=True).raw)
image_prompt = '<|user|><|image_1|>What is shown in this image?<|end|><|assistant|>'
image_inputs = processor(text=image_prompt, images=image, return_tensors='pt')
image_generate_ids = model.generate(
    **image_inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
image_response = processor.batch_decode(
    image_generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print("Image response:", image_response)

# Audio processing
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))
audio_prompt = '<|user|><|audio_1|>Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation.<|end|><|assistant|>'
audio_inputs = processor(text=audio_prompt, audios=[(audio, samplerate)], return_tensors='pt')
audio_generate_ids = model.generate(
    **audio_inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
audio_response = processor.batch_decode(
    audio_generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print("Audio response:", audio_response)

📚 詳細文檔

輸入格式

Phi-4-multimodal-instruct 模型支持多種輸入格式，包括文本、圖像和音頻。以下是不同輸入格式的示例：

文本聊天格式

用於一般對話和指令：

<|system|>You are a helpful assistant.<|end|><|user|>How to explain Internet for a medieval knight?<|end|><|assistant|>

工具啟用的函數調用格式

當用戶希望模型根據給定工具提供函數調用時使用：

<|system|>You are a helpful assistant with some tools.<|tool|>[{"name": "get_weather_updates", "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", "parameters": {"city": {"description": "The name of the city for which to retrieve weather information.", "type": "str", "default": "London"}}}]<|/tool|><|end|><|user|>What is the weather like in Paris today?<|end|><|assistant|>

視覺-語言格式

用於包含圖像的對話：

<|user|><|image_1|>Describe the image in detail.<|end|><|assistant|>

語音-語言格式

用於各種語音和音頻任務：

<|user|><|audio_1|>{task prompt}<|end|><|assistant|>

視覺-語音格式

用於包含圖像和音頻的對話：

<|user|><|image_1|><|audio_1|><|end|><|assistant|>

輸入要求

圖像

格式：支持常見的 RGB/灰度圖像格式，如 .jpg、.png 等。
分辨率：取決於 GPU 內存大小。訓練時支持 64 個裁剪，正方形圖像的分辨率約為 (8448 x 8448)。多圖像輸入時，最多支持 64 幀，但隨著幀數增加，每幀的分辨率需要降低以適應內存。

音頻

格式：支持 soundfile 包可以加載的任何音頻格式。
長度：為保證性能，建議最大音頻長度為 40 秒。對於總結任務，建議最大音頻長度為 30 分鐘。

模型加載

可以使用 transformers 庫在本地加載模型：

from transformers import AutoModelForCausalLM, AutoProcessor

model_path = "microsoft/Phi-4-multimodal-instruct"
model = AutoModelForCausalLM.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)