Mini-omni2 Open-Source Multimodal Model - Supports Image, Audio, and Text Inputs as well as Voice Dialogue Interaction

Mini Omni2

Developed by gpt-omni

Mini-Omni2 is a fully interactive multimodal model capable of understanding image, audio, and text inputs, and engaging in end-to-end voice conversations with users.

Multimodal Fusion Open Source License:MIT #Real-time voice conversation #Multimodal interaction #End-to-end voice output

Downloads 192

Release Time : 10/15/2024

Model Overview

Mini-Omni2 features real-time voice output, omnipotent multimodal understanding, and flexible interruptible speech interaction, supporting multimodal input and output of images, voice, and text.

Model Features

Multimodal interaction

Capable of understanding image, voice, and text inputs to perform comprehensive tasks.

Real-time voice conversation

Supports end-to-end voice conversation without additional ASR or TTS models.

Interruptible speech

Supports flexible interaction interruption mechanism to enhance conversation fluency.

Model Capabilities

Image understanding

Speech recognition

Text generation

Real-time voice output

Multimodal task processing

Use Cases

Smart assistant

Multimodal conversation assistant

Engages in natural interaction with users through voice, images, and text.

Provides a more natural user experience, supporting multiple input methods.

Education

Language learning assistant

Helps users learn English through voice interaction.

Provides real-time voice feedback to enhance learning effectiveness.

🚀 Mini-Omni2

Mini-Omni2 is an omni-interactive model that can understand image, audio, and text inputs and engage in end-to-end voice conversations with users.

🤗 Hugging Face | 📖 Github | 📑 Technical report

🚀 Quick Start

Interactive demo

Start server

⚠️ Important Note

You need to start the server before running the streamlit or gradio demo with API_URL set to the server address.

sudo apt-get install ffmpeg
conda activate omni
cd mini-omni2
python3 server.py --ip '0.0.0.0' --port 60808

Run streamlit demo

⚠️ Important Note

You need to run streamlit locally with PyAudio installed.

pip install PyAudio==0.2.14
API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py

Local test

conda activate omni
cd mini-omni2
# test run the preset audio samples and questions
python inference_vision.py

✨ Features

✅ Multimodal interaction: with the ability to understand images, speech and text, just like GPT-4o.
✅ Real-time speech-to-speech conversational capabilities. No extra ASR or TTS models required, just like Mini-Omni.

📦 Installation

Create a new conda environment and install the required packages:

conda create -n omni python=3.10
conda activate omni

git clone https://github.com/gpt-omni/mini-omni2.git
cd mini-omni2
pip install -r requirements.txt

💻 Usage Examples

Demo

⚠️ Important Note

Need to unmute first.

https://github.com/user-attachments/assets/ad97ca7f-f8b4-40c3-a7e8-fa54b4edf155

📚 Documentation

Mini-Omni2 Overview

1. Multimodal Modeling

We use multiple sequences as the input and output of the model. In the input part, we will concatenate image, audio and text features to perform a series of comprehensive tasks, as shown in the following figures. In the output part, we use text-guided delayed parallel output to generate real-time speech responses.

2. Multi-stage Training

We propose an efficient alignment training method and conduct encoder adaptation, modal alignment, and multimodal fine-tuning respectively in the three-stage training.

FAQ

1. Does the model support other languages?

No, the model is only trained on English. However, as we use whisper as the audio encoder, the model can understand other languages which is supported by whisper (like chinese), but the output is only in English.

2. Error: can not run streamlit in local browser, with remote streamlit server

You need start streamlit locally with PyAudio installed.

📄 License

This project is licensed under the MIT license.

Acknowledgements

Qwen2 as the LLM backbone.
litGPT for training and inference.
whisper for audio encoding.
clip for image encoding.
snac for audio decoding.
CosyVoice for generating synthetic speech.
OpenOrca and MOSS for alignment.

ToDo

[ ] update interruption mechanism

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご