Phi-4-multimodal-instruct開源模型 - 支持圖文音頻輸入，輕鬆生成文本內容

首頁

Phi 4 Multimodal Instruct

由mjtechguy開發

Phi-4-multimodal-instruct是一個輕量級開源多模態基礎模型，支持文本、圖像和音頻輸入，生成文本輸出，具備128K標記的上下文長度。

多模態融合

Transformers

支持多種語言開源協議:MIT #多模態指令 #輕量級128K上下文 #語音視覺文本融合

下載量 18

發布時間 : 2/28/2025

模型概述

該模型融合了Phi-3.5和4.0模型的語言、視覺及語音研究數據，通過監督微調、直接偏好優化及人類反饋強化學習（RLHF）的增強流程，在指令遵循精確性和安全措施方面表現優異。

模型特點

多模態支持

支持文本、圖像和音頻輸入，生成文本輸出，具備128K標記的上下文長度。

多語言支持

支持多種語言的文本、視覺和音頻處理，覆蓋全球主要語言。

高性能

在自動語音識別和語音翻譯任務中超越WhisperV3和SeamlessM4T-v2-Large，Huggingface OpenASR排行榜第一。

輕量級

適用於內存/計算資源受限環境和延遲敏感場景。

模型能力

文本生成

圖像理解

語音識別

語音翻譯

語音摘要

視覺問答

光學字符識別

圖表與表格理解

多圖像對比

多圖像或視頻片段摘要

音頻理解

使用案例

商業應用

智能客服

通過多模態輸入提供精準的客戶服務響應。

語音翻譯

即時將語音翻譯成多種語言，支持跨語言溝通。

教育

視覺數學解題

通過圖像輸入解決複雜數學問題。

多語言學習

支持多語言文本和語音的學習輔助。

研究

多模態研究

用於多模態模型的研究和開發。

🚀 Phi-4-multimodal-instruct

Phi-4-multimodal-instruct 是一款輕量級的開源多模態基礎模型，它利用了為 Phi-3.5 和 4.0 模型所做的語言、視覺和語音研究及數據集。該模型能夠處理文本、圖像和音頻輸入，並生成文本輸出，擁有 128K 的上下文長度。此外，模型經過了增強處理，結合了監督微調、直接偏好優化和基於人類反饋的強化學習（RLHF），以支持精確的指令遵循和安全措施。

🚀 快速開始

Phi-4 系列已集成到 transformers 的 4.48.2 版本中。可以使用以下命令驗證當前 transformers 的版本：pip list | grep transformers。

所需依賴包示例

flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2

Phi-4-multimodal-instruct 也可以在 Azure AI Studio 中使用。

本地加載模型

獲取 Phi-4-multimodal-instruct 模型檢查點後，可以使用以下示例代碼進行推理。

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen


# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

✨ 主要特性

多模態處理：能夠處理文本、圖像和音頻輸入，並生成文本輸出。
長上下文長度：擁有 128K 的上下文長度，支持更長的對話和更復雜的任務。
多語言支持：支持多種語言，包括阿拉伯語、中文、捷克語、丹麥語等。
強化學習優化：經過監督微調、直接偏好優化和基於人類反饋的強化學習（RLHF），支持精確的指令遵循和安全措施。

📦 安裝指南

Phi-4 系列已集成到 transformers 的 4.48.2 版本中。可以使用以下命令驗證當前 transformers 的版本：pip list | grep transformers。

所需依賴包示例

flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2

💻 使用示例

基礎用法

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen


# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

📚 詳細文檔

輸入格式

鑑於訓練數據的性質，Phi-4-multimodal-instruct 模型最適合使用以下聊天格式的提示：

文本聊天格式

此格式用於一般對話和指令：

<|system|>You are a helpful assistant.<|end|><|user|>How to explain Internet for a medieval knight?<|end|><|assistant|>

支持工具的函數調用格式

當用戶希望模型根據給定工具提供函數調用時，使用此格式。用戶應在系統提示中提供可用工具，並用 <|tool|> 和 <|/tool|> 標記包裹。工具應使用 JSON 格式指定，使用 JSON 轉儲結構。示例：

<|system|>You are a helpful assistant with some tools.<|tool|>[{"name": "get_weather_updates", "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", "parameters": {"city": {"description": "The name of the city for which to retrieve weather information.", "type": "str", "default": "London"}}}]<|/tool|><|end|><|user|>What is the weather like in Paris today?<|end|><|assistant|>

視覺 - 語言格式

用於與圖像進行對話：

<|user|><|image_1|>Describe the image in detail.<|end|><|assistant|>

對於多圖像，用戶需要在提示中插入多個圖像佔位符，如下所示：

<|user|><|image_1|><|image_2|><|image_3|>Summarize the content of the images.<|end|><|assistant|>

語音 - 語言格式

用於各種語音和音頻任務：

<|user|><|audio_1|>{task prompt}<|end|><|assistant|>

任務提示因不同任務而異。自動語音識別：

<|user|><|audio_1|>Transcribe the audio clip into text.<|end|><|assistant|>

自動語音翻譯：

<|user|><|audio_1|>Translate the audio to {lang}.<|end|><|assistant|>

帶有思維鏈的自動語音翻譯：

<|user|><|audio_1|>Transcribe the audio to text, and then translate the audio to {lang}. Use <sep> as a separator between the original transcript and the translation.<|end|><|assistant|>

語音查詢問答：

<|user|><|audio_1|><|end|><|assistant|>

視覺 - 語音格式

用於與圖像和音頻進行對話。音頻可能包含與圖像相關的查詢：

<|user|><|image_1|><|audio_1|><|end|><|assistant|>

對於多圖像，用戶需要在提示中插入多個圖像佔位符，如下所示：

<|user|><|image_1|><|image_2|><|image_3|><|audio_1|><|end|><|assistant|>

視覺

支持任何常見的 RGB/灰度圖像格式（例如，(".jpg", ".jpeg", ".png", ".ppm", ".bmp", ".pgm", ".tif", ".tiff", ".webp")）。
分辨率取決於 GPU 內存大小。更高的分辨率和更多的圖像將產生更多的標記，從而使用更多的 GPU 內存。訓練期間，支持 64 個裁剪。如果是方形圖像，分辨率約為 (8448 x 8448)。對於多圖像，最多支持 64 幀，但隨著輸入幀數的增加，每個幀的分辨率需要降低以適應內存。

音頻

支持任何可以由 soundfile 包加載的音頻格式。
為保持滿意的性能，建議最大音頻長度為 40 秒。對於總結任務，建議最大音頻長度為 30 分鐘。

模型加載

獲取 Phi-4-multimodal-instruct 模型檢查點後，可以使用以下示例代碼進行推理。

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen


# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

🔧 技術細節

模型架構

Phi-4-multimodal-instruct 是一個具有 56 億參數的多模態變壓器模型。該模型以預訓練的 Phi-4-Mini-Instruct 作為骨幹語言模型，並配備了先進的視覺和語音編碼器及適配器。

訓練數據

Phi-4-multimodal-instruct 的訓練數據包括多種來源，總計 5 萬億文本標記，是以下數據的組合：

經過質量過濾的公開可用文檔、精選的高質量教育數據和代碼。
為教授數學、編碼、常識推理和世界常識（如科學、日常活動、心智理論等）而新創建的合成“教科書式”數據。
高質量的人類標註的聊天格式數據。
精選的高質量圖像 - 文本交錯數據。
合成和公開可用的圖像、多圖像和視頻數據。
經過匿名處理的內部語音 - 文本對數據，帶有強/弱轉錄。
精選的高質量公開可用和經過匿名處理的內部語音數據，帶有特定任務的監督。
精選的合成語音數據。
合成的視覺 - 語音數據。

微調

分別提供了語音和視覺的監督微調（SFT）基本示例。

安全措施

Phi-4 系列模型採用了強大的安全後訓練方法。該方法利用了各種開源和內部生成的數據集。用於安全對齊的總體技術是監督微調（SFT）、直接偏好優化（DPO）和基於人類反饋的強化學習（RLHF）方法的組合，通過利用人類標註和合成的英語數據集，包括專注於有用性和無害性的公開可用數據集，以及針對多個安全類別的各種問答。對於非英語語言，通過機器翻譯擴展了現有數據集。語音安全數據集是通過將文本安全數據集運行通過 Azure TTS（文本轉語音）服務生成的，適用於英語和非英語語言。視覺（文本和圖像）安全數據集被創建以覆蓋在公共和內部多模態負責任人工智能（RAI）數據集中確定的危害類別。