Phi-4-multimodal-instruct Open-source Multimodal Model - Supports Multi-format Input to Generate Text Content

Phi 4 Multimodal Instruct

Developed by Robeeeeeeeeeee

Phi-4-multimodal-instruct is a lightweight open-source multimodal foundation model that integrates language, vision, and speech research and datasets from Phi-3.5 and 4.0 models. It supports text, image, and audio inputs to generate text outputs, with a context length of 128K tokens.

Multimodal Fusion

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multimodal instruction understanding #128K long context #Tri-modal (voice, vision, text)

Downloads 21

Release Time : 2/28/2025

Model Overview

This model excels in instruction-following precision and safety measures through an enhanced process of supervised fine-tuning, direct preference optimization, and reinforcement learning from human feedback (RLHF). Suitable for a wide range of commercial and research applications, it supports multilingual and multimodal tasks.

Model Features

Multimodal support

Supports simultaneous text, image, and audio inputs to generate text outputs, enabling cross-modal understanding and interaction.

Long-context processing

Features a 128K token context length, capable of handling long documents and complex conversations.

Multilingual capabilities

Supports text processing in 23 languages and audio processing in 8 languages, with strong cross-language abilities.

Lightweight design

Optimized architecture suitable for memory/computation-constrained environments and low-latency scenarios.

Reinforcement learning optimization

Enhanced model performance through supervised fine-tuning, direct preference optimization, and reinforcement learning from human feedback (RLHF).

Model Capabilities

Text generation

Image understanding

Speech recognition

Speech translation

Speech summarization

Visual question answering

Optical character recognition

Chart and table understanding

Multi-image comparison

Video clip summarization

Audio understanding

Function and tool calling

Mathematical and logical reasoning

Use Cases

Speech processing

Speech recognition

Converts speech to text, supporting multiple languages.

Word error rate as low as 6.14%, ranking first on the Huggingface OpenASR leaderboard.

Speech translation

Real-time translation of speech from one language to text in another language.

Performance surpasses WhisperV3 and SeamlessM4T-v2-Large.

Speech summarization

Extracts key information from speech content to generate summaries.

Performance approaches GPT4o.

Visual understanding

Visual question answering

Answers questions based on image content.

Scores 68.9 on the AI2D benchmark, approaching Gemini-2.0-Flash.

Math problem solving

Solves complex math problems through visual input.

Demonstrates strong image processing and equation-solving capabilities.

Intelligent assistant

Travel planning

Helps plan travel routes through speech analysis.

Demonstrates advanced audio processing and recommendation capabilities.

Content creation

Generates stories or content based on multimodal input.

Demonstrates creative generation capabilities in story liveliness demonstrations.

🚀 Phi-4-multimodal-instruct

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model. It processes text, image, and audio inputs, generating text outputs, and has a 128K token context length. It supports multiple languages and various multimodal tasks, enhancing research and application development in the field of AI.

🚀 Quick Start

The provided README does not contain specific quick - start content.

✨ Features

Multimodal Processing: Capable of handling text, image, and audio inputs, and generating text outputs.
Multilingual Support: Supports a wide range of languages including Arabic, Chinese, English, etc.
Strong Reasoning: Demonstrates strong reasoning abilities, especially in math and logic.
Function and Tool Calling: Supports function and tool calling.
Rich Use Cases: Applicable to various scenarios such as speech recognition, translation, summarization, and image understanding.

📦 Installation

The provided README does not contain installation steps.

💻 Usage Examples

The provided README does not contain code examples.

📚 Documentation

Model Summary

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi - 3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine - tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures.

The languages that each modal supports are the following:

Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Vision: English
Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

📰 Phi-4-multimodal Microsoft Blog
📖 Phi-4-multimodal Technical Report
🏡 Phi Portal
👩‍🍳 Phi Cookbook
🖥️ Try It on Azure, Nvidia Playgroud
📱Huggingface Spaces Thoughts Organizer, Stories Come Alive, Phine Speech Translator

Phi-4: [multimodal-instruct | onnx]; mini-instruct;

Watch as Phi-4 Multimodal analyzes spoken language to help plan a trip to Seattle, demonstrating its advanced audio processing and recommendation capabilities.

See how Phi-4 Multimodal tackles complex mathematical problems through visual inputs, demonstrating its ability to process and solve equations presented in images.

Explore how Phi-4 Mini functions as an intelligent agent, showcasing its reasoning and task execution abilities in complex scenarios.

Intended Uses

Primary Use Cases

The model is intended for broad multilingual and multimodal commercial and research use. The model provides uses for general purpose AI systems and applications which require:

Memory/compute constrained environments
Latency bound scenarios
Strong reasoning (especially math and logic)
Function and tool calling
General image understanding
Optical character recognition
Chart and table understanding
Multiple image comparison
Multi - image or video clip summarization
Speech recognition
Speech translation
Speech QA
Speech summarization
Audio understanding

The model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.

Use Case Considerations

The model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models and multimodal models, as well as performance difference across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high - risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

Release Notes

This release of Phi-4-multimodal-instruct is based on valuable user feedback from the Phi - 3 series. Previously, users could use a speech recognition model to talk to the Mini and Vision models. To achieve this, users needed to use a pipeline of two models: one model to transcribe the audio to text, and another model for the language or vision tasks. This pipeline means that the core model was not provided the full breadth of input information – e.g. cannot directly observe multiple speakers, background noises, jointly align speech, vision, language information at the same time on the same representation space.

With Phi-4-multimodal-instruct, a single new open model has been trained across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. The model employed new architecture, larger vocabulary for efficiency, multilingual, and multimodal support, and better post - training techniques were used for instruction following and function calling, as well as additional data leading to substantial gains on key multimodal capabilities.

It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi - 4 series is greatly appreciated. Feedback on Phi - 4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!

Model Quality

To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high - level overview of the model quality on representative speech and vision benchmarks:

Speech

The Phi-4-multimodal-instruct was observed as:

Having strong automatic speech recognition (ASR) and speech translation (ST) performance, surpassing expert ASR model WhisperV3 and ST models SeamlessM4T - v2 - Large.
Ranking number 1 on the Huggingface OpenASR leaderboard with word error rate 6.14% in comparison with the current best model 6.5% as of Jan 17, 2025.
Being the first open - sourced model that can perform speech summarization, and the performance is close to GPT4o.
Having a gap with close models, e.g. Gemini - 1.5 - Flash and GPT - 4o - realtime - preview, on speech QA task. Work is being undertaken to improve this capability in the next iterations.

Speech Recognition (lower is better)

The performance of Phi-4-multimodal-instruct on the aggregated benchmark datasets: alt text

The performance of Phi-4-multimodal-instruct on different languages, averaging the WERs of CommonVoice and FLEURS:

alt text

Speech Translation (higher is better)

Translating from German, Spanish, French, Italian, Japanese, Portugues, Chinese to English:

alt text

Translating from English to German, Spanish, French, Italian, Japanese, Portugues, Chinese. Noted that WhiperV3 does not support this capability:

alt text

Speech Summarization (higher is better)

alt text

Speech QA

MT bench scores are scaled by 10x to match the score range of MMMLU:

alt text

Audio Understanding

AIR bench scores are scaled by 10x to match the score range of MMAU:

alt text

Vision

Vision - Speech tasks

Phi-4-multimodal-instruct is capable of processing both image and audio together, the following table shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other existing state - of - the - art omni models that can enable audio and visual signal as input, Phi-4-multimodal-instruct achieves much stronger performance on multiple benchmarks.

Benchmarks	Phi-4-multimodal-instruct	InternOmni-7B	Gemini-2.0-Flash-Lite-prv-02-05	Gemini-2.0-Flash	Gemini-1.5-Pro
s_AI2D	68.9	53.9	62.0	69.4	67.7
s_ChartQA	69.0	56.1	35.5	51.3	46.9
s_DocVQA	87.3	79.9	76.0	80.3	78.2
s_InfoVQA	63.7	60.3	59.4	63.6	66.1
Average	72.2	62.6	58.2	66.2	64.7

Vision tasks

To understand the vision capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of zero - shot benchmarks using an internal benchmark platform. At the high - level overview of the model quality on representative benchmarks:

Dataset	Phi-4-multimodal-ins	Phi-3.5-vision-ins	Qwen 2.5-VL-3B-ins	Intern VL 2.5-4B	Qwen 2.5-VL-7B-ins	Intern VL 2.5-8B	Gemini 2.0-Flash Lite-preview-0205	Gemini2.0-Flash	Claude-3.5-Sonnet-2024-10-22	Gpt-4o-2024-11-20
Popular aggregated benchmark
MMMU	55.1	43.0	47.0	48.3	51.8	50.6	54.1	64.7	55.8	61.7
MMBench (dev-en)	86.7	81.9	84.3	86.8	87.8	88.2	85.0	90.0	86.7	89.0
MMMU-Pro (std/vision)	38.5	21.8	29.9	32.4	36.9	34.4	45.1	54.4	54.3	53.0
Visual science reasoning
ScienceQA Visual (img-test)	97.5	91.3	79.4	96.2	87.7	97.3	85.0	88.3	81.2	88.2
Visual math reasoning
MathVista (testmini)	62.4	43.9	60.8	51.2	67.8	56.7	57.6	47.2	56.9	56.1
InterGPS	48.6	36.3	48.3	53.7	52.7	54.1	57.9	65.4	47.1	49.1
Chart & table reasoning
AI2D	82.3	78.1	78.4	80.0	82.6	83.0	77.6	82.1	70.6	83.8
ChartQA	81.4	81.8	80.0	79.1	85.0	81.0	73.0	79.0	78.4	75.1
DocVQA	93.2	69.3	93.9	91.6	95.7	93.0	91.2	92.1	95.2	90.9
InfoVQA	72.7	36.6	77.1	72.1	82.6	77.6	73.0	77.8	74.3	71.9
Document Intelligence
TextVQA (val)	75.6	72.0	76.8	70.9	77.7	74.8	72.9	74.4	58.6	73.1
OCR Bench	84.4	63.8	82.2	71.6	87.7	74.8	75.7	81.0	77.0	77.7
Object visual presence verification
POPE	85.6	86.1	87.9	89.4	87.5	89.1	87.5	88.0	82.6	86.5
Multi-image perception
BLINK	61.3	57.0	48.1	51.2	55.3	52.5	59.3	64.0	56.9	62.4
Video MME 16 frames	55.0	50.8	56.5	57.3	58.2	58.7	58.8	65.5	60.2	68.2
Average	72.0	60.9	68.7	68.8	73.1	71.1	70.2	74.3	69.1	72.4

alt text

Visual Perception

Below are the comparison results on existing multi - image tasks. On average, Phi-4-multimodal-instruct outperforms competitor models of the same size and competitive with much bigger models on multi - frame capabilities. BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.

Dataset	Phi-4-multimodal-instruct	Qwen2.5-VL-3B-Instruct	InternVL 2.5-4B	Qwen2.5-VL-7B-Instruct	InternVL 2.5-8B	Gemini-2.0-Flash-Lite-prv-02-05	Gemini-2.0-Flash	Claude-3.5-Sonnet-2024-10-22	Gpt-4o-2024-11-20
Art Style	86.3	58.1	59.8	65.0	65.0	76.9	76.9	68.4	73.5
Counting	60.0	67.5	60.0	66.7	71.7	45.8	69.2	60.8	65.0
Forensic Detection	90.2	34.8	22.0	43.9	37.9	31.8	74.2	63.6	71.2
Functional Correspondence	30.0	20.0	26.9	22.3	27.7	48.5	53.1	34.6	42.3
IQ Test	22.7	25.3	28.7	28.7	28.7	28.0	30.7	20.7	25.3
Jigsaw	68.7	52.0	71.3	69.3	53.3	62.7	69.3	61.3	68.7
Multi-View Reasoning	76.7	44.4	44.4	54.1	45.1	55.6	41.4	54.9	54.1
Object Localization	52.5	55.7	53.3	53.3	53.3	55.6	51.4	54.9	54.1

📄 License

This model is released under the MIT license. License Link

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご