Phi-4-multimodal-instruct-onnx Multimodal Model - Open source and free with support for text, image and audio input, accelerating inference

Phi 4 Multimodal Instruct Onnx

Developed by microsoft

ONNX version of the Phi-4 multimodal model, quantized to int4 precision with accelerated inference via ONNX Runtime, supporting text, image, and audio inputs.

Multimodal Fusion OtherOpen Source License:MIT #Multimodal instruction understanding #ONNX quantization acceleration #Cross-modal interaction

Downloads 159

Release Time : 2/25/2025

Model Overview

This is a lightweight open-source multimodal foundation model combining language, vision, and speech research from Phi-3.5 and 4.0 models, supporting a context length of 128K tokens.

Model Features

Multimodal support

Supports processing text, image, and audio inputs to generate text output.

Efficient inference

Quantized to int4 precision with accelerated inference via ONNX Runtime.

Long context support

Supports a context length of 128K tokens.

Lightweight

Lightweight open-source multimodal foundation model suitable for various application scenarios.

Model Capabilities

Text generation

Image analysis

Speech recognition

Speech summarization

Speech translation

Visual question answering

Use Cases

Speech processing

Automatic speech recognition

Convert speech to text.

Speech summarization

Generate summaries of speech content.

Speech translation

Translate speech content into other languages.

Visual processing

Visual question answering

Answer questions based on image content.

🚀 Phi-4 Multimodal Instruct ONNX models

This is an ONNX version of the Phi-4 multimodal model, quantized to int4 precision for accelerated inference with ONNX Runtime.

🚀 Quick Start

✨ Features

Tags: nlp, code, audio, automatic-speech-recognition, speech-summarization, speech-translation, visual-question-answering, phi-4-multimodal, phi, phi-4-mini

📦 Installation

For CPU

Stay tuned or follow this tutorial to generate your own ONNX models for CPU!

For CUDA

# Download the model directly using the Hugging Face CLI
huggingface-cli download microsoft/Phi-4-multimodal-instruct-onnx --include gpu/* --local-dir .

# Install the CUDA package of ONNX Runtime GenAI
pip install --pre onnxruntime-genai-cuda

# Please adjust the model directory (-m) accordingly 
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi4-mm.py -o phi4-mm.py
python phi4-mm.py -m gpu/gpu-int4-rtn-block-32 -e cuda

For DirectML

# Download the model directly using the Hugging Face CLI
huggingface-cli download microsoft/Phi-4-multimodal-instruct-onnx --include gpu/* --local-dir .

# Install the DML package of ONNX Runtime GenAI
pip install --pre onnxruntime-genai-directml

# Please adjust the model directory (-m) accordingly 
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi4-mm.py -o phi4-mm.py
python phi4-mm.py -m gpu/gpu-int4-rtn-block-32 -e dml

You will be prompted to provide any images, audios, and a prompt.

📚 Documentation

Model Description

Property	Details
Developed by	Microsoft
Model Type	ONNX
License	MIT
Model Description	This is a conversion of Phi4 multimodal model for ONNX Runtime inference.

Disclaimer: Model is only an optimization of the base model, any risk associated with the model is the responsibility of the user of the model. Please verify and test for you scenarios. There may be a slight difference in output from the base model with the optimizations applied.

Base Model

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, and direct preference optimization to support precise instruction adherence and safety measures.

See details here

The performance of the text component is similar to the Phi-4 mini ONNX models

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご