Qwen2.5-Omni-7B-GPTQ-Int4 Open-source Multimodal Model - Supports Audio-Visual-Text Responses, Free to Deploy!

Qwen2.5 Omni 7B GPTQ Int4

Developed by Qwen

Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving various modalities such as text, images, audio, and video, and generating text and natural speech responses in a streaming manner.

Multimodal Fusion

Transformers

EnglishOpen Source License:Other #Omni-modal Interaction #Real-time Voice and Video #Low Memory Optimization

Downloads 389

Release Time : 5/14/2025

Model Overview

Qwen2.5-Omni is an end-to-end multimodal model designed for real-time interaction, supporting the perception and generation of text, images, audio, and video.

Model Features

Omni-modal and Novel Architecture

Supports perception and generation of text, images, audio, and video, employing the Thinker-Talker architecture and TMRoPE positional embeddings.

Real-time Voice and Video Chat

Designed for fully real-time interaction, supporting chunked input and instant output.

Natural and Robust Speech Generation

Demonstrates exceptional robustness and naturalness in speech generation, surpassing many existing streaming and non-streaming alternatives.

Strong Cross-modal Performance

Exhibits outstanding performance across all modalities, competitive with single-modal models of similar scale.

End-to-end Voice Instruction Following

Excels in end-to-end voice instruction following, achieving results comparable to text input.

Model Capabilities

Text Generation

Image Analysis

Speech Recognition

Speech Synthesis

Video Analysis

Use Cases

Real-time Interaction

Real-time Voice Chat

Supports real-time voice input and output, suitable for applications like voice assistants.

Natural and robust speech generation effects.

Video Analysis

Supports real-time analysis and response to video content.

Achieves 72.4% accuracy on the VideoMME benchmark.

Speech Processing

Speech Recognition

Supports high-precision speech-to-text conversion.

Achieves a WER of 3.4 on the LibriSpeech test-other dataset.

Speech Synthesis

Supports the generation of natural speech.

Achieves a WER of 8.7 on the Seed-TTS test-hard dataset.

🚀 Qwen2.5-Omni-7B-GPTQ-Int4

Qwen2.5-Omni-7B-GPTQ-Int4 is an end-to-end multimodal model that can perceive various modalities and generate text and natural speech responses in a streaming manner, with optimizations for low-GPU-memory devices.

🚀 Quick Start

This model card introduces a series of enhancements designed to improve the Qwen2.5-Omni-7B's operability on devices with constrained GPU memory. Key optimizations include:

Implemented 4-bit quantization of the Thinker's weights using GPTQ, effectively reducing GPU VRAM usage.
Enhanced the inference pipeline to load model weights on-demand for each module and offload them to CPU memory once inference is complete, preventing peak VRAM usage from becoming excessive.
Converted the token2wav module to support streaming inference, thereby avoiding the pre-allocation of excessive GPU memory.
Adjusted the ODE solver from a second-order (RK4) to a first-order (Euler) method to further decrease computational overhead.

These improvements aim to ensure efficient performance of Qwen2.5-Omni across a range of hardware configurations, particularly those with lower GPU memory availability (RTX3080, 4080, 5070, etc).

Below, we provide a simple example to show how to use Qwen2.5-Omni-7B-GPTQ-Int4 with gptqmodel as follows:

pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
pip install accelerate
pip install gptqmodel==2.0.0
pip install numpy==2.0.0

git clone https://github.com/QwenLM/Qwen2.5-Omni.git

cd Qwen2.5-Omni/low-VRAM-mode/

CUDA_VISIBLE_DEVICES=0 python3 low_VRAM_demo_gptq.py

We offer a toolkit to help you handle various types of audio and visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved audio, images and videos. You can install it using the following command and make sure your system has ffmpeg installed:

# It's highly recommended to use `[decord]` feature for faster video loading.
pip install qwen-omni-utils[decord] -U

If you are not using Linux, you might not be able to install decord from PyPI. In that case, you can use pip install qwen-omni-utils -U which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video.

Performance and GPU memory requirements

The following two tables present a performance comparison and GPU memory consumption between Qwen2.5-Omni-7B-GPTQ-Int4 and Qwen2.5-Omni-7B on specific evaluation benchmarks. The data demonstrates that the GPTQ-Int4 model maintains comparable performance while reducing GPU memory requirements by over 50%+, enabling a broader range of devices to run and experience the high-performance Qwen2.5-Omni-7B model. Notably, the GPTQ-Int4 variant exhibits slightly slower inference speeds compared to the native Qwen2.5-Omni-7B model due to quantization techniques and CPU offload mechanisms.

Evaluation Set	Task	Metrics	Qwen2.5-Omni-7B	Qwen2.5-Omni-7B-GPTQ-Int4
LibriSpeech test-other	ASR	WER ⬇️	3.4	3.71
WenetSpeech test-net	ASR	WER ⬇️	5.9	6.62
Seed-TTS test-hard	TTS (Speaker: Chelsie)	WER ⬇️	8.7	10.3
MMLU-Pro	Text -> Text	Accuracy ⬆️	47.0	43.76
OmniBench	Speech -> Text	Accuracy ⬆️	56.13	53.59
VideoMME	Multimodality -> Text	Accuracy ⬆️	72.4	68.0

Model	Precision	15(s) Video	30(s) Video	60(s) Video
Qwen-Omni-7B	FP32	93.56 GB	Not Recommend	Not Recommend
Qwen-Omni-7B	BF16	31.11 GB	41.85 GB	60.19 GB
Qwen-Omni-7B	GPTQ-Int4	11.64 GB	17.43 GB	29.51 GB

✨ Features

Introduction

Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.

Key Features

Omni and Novel Architecture: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio.
Real-Time Voice and Video Chat: Architecture designed for fully real-time interactions, supporting chunked input and immediate output.
Natural and Robust Speech Generation: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation.
Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B.
Excellent End-to-End Speech Instruction Following: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K.

Model Architecture

📄 License

This project is licensed under the Apache-2.0 License.

📚 Documentation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

@article{Qwen2.5-Omni,
  title={Qwen2.5-Omni Technical Report},
  author={Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin},
  journal={arXiv preprint arXiv:2503.20215},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご