Phi-4-multimodal-instruct开源模型 - 支持图文音频输入，轻松生成文本内容

首页

Phi 4 Multimodal Instruct

由 mjtechguy 开发

Phi-4-multimodal-instruct是一个轻量级开源多模态基础模型，支持文本、图像和音频输入，生成文本输出，具备128K标记的上下文长度。

多模态融合

Transformers

支持多种语言开源协议:MIT #多模态指令 #轻量级128K上下文 #语音视觉文本融合

下载量 18

发布时间 : 2/28/2025

模型简介

该模型融合了Phi-3.5和4.0模型的语言、视觉及语音研究数据，通过监督微调、直接偏好优化及人类反馈强化学习（RLHF）的增强流程，在指令遵循精确性和安全措施方面表现优异。

模型特点

多模态支持

支持文本、图像和音频输入，生成文本输出，具备128K标记的上下文长度。

多语言支持

支持多种语言的文本、视觉和音频处理，覆盖全球主要语言。

高性能

在自动语音识别和语音翻译任务中超越WhisperV3和SeamlessM4T-v2-Large，Huggingface OpenASR排行榜第一。

轻量级

适用于内存/计算资源受限环境和延迟敏感场景。

模型能力

文本生成

图像理解

语音识别

语音翻译

语音摘要

视觉问答

光学字符识别

图表与表格理解

多图像对比

多图像或视频片段摘要

音频理解

使用案例

商业应用

智能客服

通过多模态输入提供精准的客户服务响应。

语音翻译

实时将语音翻译成多种语言，支持跨语言沟通。

教育

视觉数学解题

通过图像输入解决复杂数学问题。

多语言学习

支持多语言文本和语音的学习辅助。

研究

多模态研究

用于多模态模型的研究和开发。

🚀 Phi-4-multimodal-instruct

Phi-4-multimodal-instruct 是一款轻量级的开源多模态基础模型，它利用了为 Phi-3.5 和 4.0 模型所做的语言、视觉和语音研究及数据集。该模型能够处理文本、图像和音频输入，并生成文本输出，拥有 128K 的上下文长度。此外，模型经过了增强处理，结合了监督微调、直接偏好优化和基于人类反馈的强化学习（RLHF），以支持精确的指令遵循和安全措施。

🚀 快速开始

Phi-4 系列已集成到 transformers 的 4.48.2 版本中。可以使用以下命令验证当前 transformers 的版本：pip list | grep transformers。

所需依赖包示例

flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2

Phi-4-multimodal-instruct 也可以在 Azure AI Studio 中使用。

本地加载模型

获取 Phi-4-multimodal-instruct 模型检查点后，可以使用以下示例代码进行推理。

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen


# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

✨ 主要特性

多模态处理：能够处理文本、图像和音频输入，并生成文本输出。
长上下文长度：拥有 128K 的上下文长度，支持更长的对话和更复杂的任务。
多语言支持：支持多种语言，包括阿拉伯语、中文、捷克语、丹麦语等。
强化学习优化：经过监督微调、直接偏好优化和基于人类反馈的强化学习（RLHF），支持精确的指令遵循和安全措施。

📦 安装指南

Phi-4 系列已集成到 transformers 的 4.48.2 版本中。可以使用以下命令验证当前 transformers 的版本：pip list | grep transformers。

所需依赖包示例

flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2

💻 使用示例

基础用法

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen


# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

📚 详细文档

输入格式

鉴于训练数据的性质，Phi-4-multimodal-instruct 模型最适合使用以下聊天格式的提示：

文本聊天格式

此格式用于一般对话和指令：

<|system|>You are a helpful assistant.<|end|><|user|>How to explain Internet for a medieval knight?<|end|><|assistant|>

支持工具的函数调用格式

当用户希望模型根据给定工具提供函数调用时，使用此格式。用户应在系统提示中提供可用工具，并用 <|tool|> 和 <|/tool|> 标记包裹。工具应使用 JSON 格式指定，使用 JSON 转储结构。示例：

<|system|>You are a helpful assistant with some tools.<|tool|>[{"name": "get_weather_updates", "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", "parameters": {"city": {"description": "The name of the city for which to retrieve weather information.", "type": "str", "default": "London"}}}]<|/tool|><|end|><|user|>What is the weather like in Paris today?<|end|><|assistant|>

视觉 - 语言格式

用于与图像进行对话：

<|user|><|image_1|>Describe the image in detail.<|end|><|assistant|>

对于多图像，用户需要在提示中插入多个图像占位符，如下所示：

<|user|><|image_1|><|image_2|><|image_3|>Summarize the content of the images.<|end|><|assistant|>

语音 - 语言格式

用于各种语音和音频任务：

<|user|><|audio_1|>{task prompt}<|end|><|assistant|>

任务提示因不同任务而异。自动语音识别：

<|user|><|audio_1|>Transcribe the audio clip into text.<|end|><|assistant|>

自动语音翻译：

<|user|><|audio_1|>Translate the audio to {lang}.<|end|><|assistant|>

带有思维链的自动语音翻译：

<|user|><|audio_1|>Transcribe the audio to text, and then translate the audio to {lang}. Use <sep> as a separator between the original transcript and the translation.<|end|><|assistant|>

语音查询问答：

<|user|><|audio_1|><|end|><|assistant|>

视觉 - 语音格式

用于与图像和音频进行对话。音频可能包含与图像相关的查询：

<|user|><|image_1|><|audio_1|><|end|><|assistant|>

对于多图像，用户需要在提示中插入多个图像占位符，如下所示：

<|user|><|image_1|><|image_2|><|image_3|><|audio_1|><|end|><|assistant|>

视觉

支持任何常见的 RGB/灰度图像格式（例如，(".jpg", ".jpeg", ".png", ".ppm", ".bmp", ".pgm", ".tif", ".tiff", ".webp")）。
分辨率取决于 GPU 内存大小。更高的分辨率和更多的图像将产生更多的标记，从而使用更多的 GPU 内存。训练期间，支持 64 个裁剪。如果是方形图像，分辨率约为 (8448 x 8448)。对于多图像，最多支持 64 帧，但随着输入帧数的增加，每个帧的分辨率需要降低以适应内存。

音频

支持任何可以由 soundfile 包加载的音频格式。
为保持满意的性能，建议最大音频长度为 40 秒。对于总结任务，建议最大音频长度为 30 分钟。

模型加载

获取 Phi-4-multimodal-instruct 模型检查点后，可以使用以下示例代码进行推理。

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen


# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

🔧 技术细节

模型架构

Phi-4-multimodal-instruct 是一个具有 56 亿参数的多模态变压器模型。该模型以预训练的 Phi-4-Mini-Instruct 作为骨干语言模型，并配备了先进的视觉和语音编码器及适配器。

训练数据

Phi-4-multimodal-instruct 的训练数据包括多种来源，总计 5 万亿文本标记，是以下数据的组合：

经过质量过滤的公开可用文档、精选的高质量教育数据和代码。
为教授数学、编码、常识推理和世界常识（如科学、日常活动、心智理论等）而新创建的合成“教科书式”数据。
高质量的人类标注的聊天格式数据。
精选的高质量图像 - 文本交错数据。
合成和公开可用的图像、多图像和视频数据。
经过匿名处理的内部语音 - 文本对数据，带有强/弱转录。
精选的高质量公开可用和经过匿名处理的内部语音数据，带有特定任务的监督。
精选的合成语音数据。
合成的视觉 - 语音数据。

微调

分别提供了语音和视觉的监督微调（SFT）基本示例。

安全措施

Phi-4 系列模型采用了强大的安全后训练方法。该方法利用了各种开源和内部生成的数据集。用于安全对齐的总体技术是监督微调（SFT）、直接偏好优化（DPO）和基于人类反馈的强化学习（RLHF）方法的组合，通过利用人类标注和合成的英语数据集，包括专注于有用性和无害性的公开可用数据集，以及针对多个安全类别的各种问答。对于非英语语言，通过机器翻译扩展了现有数据集。语音安全数据集是通过将文本安全数据集运行通过 Azure TTS（文本转语音）服务生成的，适用于英语和非英语语言。视觉（文本和图像）安全数据集被创建以覆盖在公共和内部多模态负责任人工智能（RAI）数据集中确定的危害类别。