đ Phi-4-multimodal-instruct
Phi-4-multimodal-instruct is a lightweight open multimodal foundation model. It can process text, image, and audio inputs, and generate text outputs. With a 128K token context length, it supports multiple languages and has undergone enhancement processes to ensure precise instruction adherence and safety.
đ Quick Start
The provided README doesn't have specific quick - start steps. You can refer to the official links below for more information on how to get started with the model:
⨠Features
- Multimodal Capability: Processes text, image, and audio inputs, generating text outputs.
- Multilingual Support: Supports a wide range of languages including Arabic, Chinese, Czech, etc.
- Enhanced Architecture: Incorporates supervised fine - tuning, direct preference optimization, and RLHF for better performance.
- Large Token Context: Comes with a 128K token context length.
đ Documentation
Model Summary
Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine - tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures.
The languages that each modal supports are the following:
- Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
- Vision: English
- Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
Intended Uses
Primary Use Cases
The model is intended for broad multilingual and multimodal commercial and research use. The model provides uses for general purpose AI systems and applications which require:
- Memory/compute constrained environments
- Latency bound scenarios
- Strong reasoning (especially math and logic)
- Function and tool calling
- General image understanding
- Optical character recognition
- Chart and table understanding
- Multiple image comparison
- Multi - image or video clip summarization
- Speech recognition
- Speech translation
- Speech QA
- Speech summarization
- Audio understanding
The model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.
Use Case Considerations
The model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models and multimodal models, as well as performance difference across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high - risk scenarios.
Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case.
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
Release Notes
This release of Phi-4-multimodal-instruct is based on valuable user feedback from the Phi-3 series. Previously, users could use a speech recognition model to talk to the Mini and Vision models. To achieve this, users needed to use a pipeline of two models: one model to transcribe the audio to text, and another model for the language or vision tasks. This pipeline means that the core model was not provided the full breadth of input information â e.g. cannot directly observe multiple speakers, background noises, jointly align speech, vision, language information at the same time on the same representation space.
With Phi-4-multimodal-instruct, a single new open model has been trained across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. The model employed new architecture, larger vocabulary for efficiency, multilingual, and multimodal support, and better post - training techniques were used for instruction following and function calling, as well as additional data leading to substantial gains on key multimodal capabilities.
It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
Model Quality
To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high - level overview of the model quality on representative speech and vision benchmarks:
Speech
The Phi-4-multimodal-instruct was observed as:
- Having strong automatic speech recognition (ASR) and speech translation (ST) performance, surpassing expert ASR model WhisperV3 and ST models SeamlessM4T - v2 - Large.
- Ranking number 1 on the Huggingface OpenASR leaderboard with word error rate 6.14% in comparison with the current best model 6.5% as of Jan 17, 2025.
- Being the first open - sourced model that can perform speech summarization, and the performance is close to GPT4o.
- Having a gap with close models, e.g. Gemini - 1.5 - Flash and GPT - 4o - realtime - preview, on speech QA task. Work is being undertaken to improve this capability in the next iterations.
Speech Recognition (lower is better)
The performance of Phi-4-multimodal-instruct on the aggregated benchmark datasets:

The performance of Phi-4-multimodal-instruct on different languages, averaging the WERs of CommonVoice and FLEURS:

Speech Translation (higher is better)
Translating from German, Spanish, French, Italian, Japanese, Portugues, Chinese to English:

Translating from English to German, Spanish, French, Italian, Japanese, Portugues, Chinese. Noted that WhiperV3 does not support this capability:

Speech Summarization (higher is better)

Speech QA
MT bench scores are scaled by 10x to match the score range of MMMLU:

Audio Understanding
AIR bench scores are scaled by 10x to match the score range of MMAU:

Vision
Vision - Speech tasks
Phi-4-multimodal-instruct is capable of processing both image and audio together, the following table shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other existing state - of - the - art omni models that can enable audio and visual signal as input, Phi-4-multimodal-instruct achieves much stronger performance on multiple benchmarks.
Benchmarks |
Phi-4-multimodal-instruct |
InternOmni-7B |
Gemini-2.0-Flash-Lite-prv-02-05 |
Gemini-2.0-Flash |
Gemini-1.5-Pro |
s_AI2D |
68.9 |
53.9 |
62.0 |
69.4 |
67.7 |
s_ChartQA |
69.0 |
56.1 |
35.5 |
51.3 |
46.9 |
s_DocVQA |
87.3 |
79.9 |
76.0 |
80.3 |
78.2 |
s_InfoVQA |
63.7 |
60.3 |
59.4 |
63.6 |
66.1 |
Average |
72.2 |
62.6 |
58.2 |
66.2 |
64.7 |
Vision tasks
To understand the vision capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of zero - shot benchmarks using an internal benchmark platform. At the high - level overview of the model quality on representative benchmarks:
Dataset |
Phi-4-multimodal-ins |
Phi-3.5-vision-ins |
Qwen 2.5-VL-3B-ins |
Intern VL 2.5-4B |
Qwen 2.5-VL-7B-ins |
Intern VL 2.5-8B |
Gemini 2.0-Flash Lite-preview-0205 |
Gemini2.0-Flash |
Claude-3.5-Sonnet-2024-10-22 |
Gpt-4o-2024-11-20 |
Popular aggregated benchmark |
|
|
|
|
|
|
|
|
|
|
MMMU |
55.1 |
43.0 |
47.0 |
48.3 |
51.8 |
50.6 |
54.1 |
64.7 |
55.8 |
61.7 |
MMBench (dev-en) |
86.7 |
81.9 |
84.3 |
86.8 |
87.8 |
88.2 |
85.0 |
90.0 |
86.7 |
89.0 |
MMMU-Pro (std/vision) |
38.5 |
21.8 |
29.9 |
32.4 |
36.9 |
34.4 |
45.1 |
54.4 |
54.3 |
53.0 |
Visual science reasoning |
|
|
|
|
|
|
|
|
|
|
ScienceQA Visual (img-test) |
97.5 |
91.3 |
79.4 |
96.2 |
87.7 |
97.3 |
85.0 |
88.3 |
81.2 |
88.2 |
Visual math reasoning |
|
|
|
|
|
|
|
|
|
|
MathVista (testmini) |
62.4 |
43.9 |
60.8 |
51.2 |
67.8 |
56.7 |
57.6 |
47.2 |
56.9 |
56.1 |
InterGPS |
48.6 |
36.3 |
48.3 |
53.7 |
52.7 |
54.1 |
57.9 |
65.4 |
47.1 |
49.1 |
Chart & table reasoning |
|
|
|
|
|
|
|
|
|
|
AI2D |
82.3 |
78.1 |
78.4 |
80.0 |
82.6 |
83.0 |
77.6 |
82.1 |
70.6 |
83.8 |
ChartQA |
81.4 |
81.8 |
80.0 |
79.1 |
85.0 |
81.0 |
73.0 |
79.0 |
78.4 |
75.1 |
DocVQA |
93.2 |
69.3 |
93.9 |
91.6 |
95.7 |
93.0 |
91.2 |
92.1 |
95.2 |
90.9 |
InfoVQA |
72.7 |
36.6 |
77.1 |
72.1 |
82.6 |
77.6 |
73.0 |
77.8 |
74.3 |
71.9 |
Document Intelligence |
|
|
|
|
|
|
|
|
|
|
TextVQA (val) |
75.6 |
72.0 |
76.8 |
70.9 |
77.7 |
74.8 |
72.9 |
74.4 |
58.6 |
73.1 |
OCR Bench |
84.4 |
63.8 |
82.2 |
71.6 |
87.7 |
74.8 |
75.7 |
81.0 |
77.0 |
77.7 |
Object visual presence verification |
|
|
|
|
|
|
|
|
|
|
POPE |
85.6 |
86.1 |
87.9 |
89.4 |
87.5 |
89.1 |
87.5 |
88.0 |
82.6 |
86.5 |
Multi-image perception |
|
|
|
|
|
|
|
|
|
|
BLINK |
61.3 |
57.0 |
48.1 |
51.2 |
55.3 |
52.5 |
59.3 |
64.0 |
56.9 |
62.4 |
Video MME 16 frames |
55.0 |
50.8 |
56.5 |
57.3 |
58.2 |
58.7 |
58.8 |
65.5 |
60.2 |
68.2 |
Average |
72.0 |
60.9 |
68.7 |
68.8 |
73.1 |
71.1 |
70.2 |
74.3 |
69.1 |
72.4 |

Visual Perception
Below are the comparison results on existing multi - image tasks. On average, Phi-4-multimodal-instruct outperforms competitor models of the same size and competitive with much bigger models on multi - frame capabilities.
BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.
Dataset |
Phi-4-multimodal-instruct |
Qwen2.5-VL-3B-Instruct |
InternVL 2.5-4B |
Qwen2.5-VL-7B-Instruct |
InternVL 2.5-8B |
Gemini-2.0-Flash-Lite-prv-02-05 |
Gemini-2.0-Flash |
Claude-3.5-Sonnet-2024-10-22 |
Gpt-4o-2024-11-20 |
Art Style |
86.3 |
58.1 |
59.8 |
65.0 |
65.0 |
76.9 |
76.9 |
68.4 |
73.5 |
Counting |
60.0 |
67.5 |
60.0 |
66.7 |
71.7 |
45.8 |
69.2 |
60.8 |
65.0 |
Forensic Detection |
90.2 |
34.8 |
22.0 |
43.9 |
37.9 |
31.8 |
74.2 |
63.6 |
71.2 |
Functional Correspondence |
30.0 |
20.0 |
26.9 |
22.3 |
27.7 |
48.5 |
53.1 |
34.6 |
42.3 |
IQ Test |
22.7 |
25.3 |
28.7 |
28.7 |
28.7 |
28.0 |
30.7 |
20.7 |
25.3 |
Jigsaw |
68.7 |
52.0 |
71.3 |
69.3 |
53.3 |
62.7 |
69.3 |
61.3 |
68.7 |
Multi-View Reasoning |
76.7 |
44.4 |
44.4 |
54.1 |
45.1 |
55.6 |
41.4 |
54.9 |
54.1 |
Object Localization |
52.5 |
55.7 |
53.3 |
53.3 |
55.7 |
55.7 |
55.7 |
53.3 |
55.7 |
đ License
This model is released under the MIT license.