Qwen2.5 - Omni - 7B Open - Source Multimodal Model - Perceive Audio and Video and Stream

Qwen2.5 Omni 7B GGUF

Developed by Mungert

Qwen2.5-Omni-7B is a powerful multimodal model that can perceive various modal information such as text, images, audio, and video, and generate text and natural voice responses in a streaming manner.

Multimodal Fusion

Transformers

EnglishOpen Source License:Other #Multimodal streaming response #Real-time voice and video interaction #Low-bit quantization optimization

Downloads 979

Release Time : 6/11/2025

Model Overview

This model is an end-to-end multimodal model designed to perceive multiple modalities, including text, images, audio, and video, while generating text and natural voice responses in a streaming manner.

Model Features

Full-modal perception

It can perceive various modal information such as text, images, audio, and video.

Streaming response

Generate text and natural voice responses in a streaming manner to achieve real-time interaction.

New quantization method

Improve the quantization accuracy of important layers through rules, and perform better in low-bit quantization and MOE models.

Real-time voice and video chat

The architecture is designed for fully real-time interaction, supporting block input and instant output.

Powerful cross-modal performance

Perform better than single-modal models and closed-source models of similar scale in multimodal tasks.

Model Capabilities

Text generation

Image analysis

Voice recognition

Video understanding

Audio understanding

Voice generation

Multimodal task processing

Use Cases

Real-time interaction

Real-time voice chat

Support real-time voice input and output to achieve natural conversations.

Perform better than many existing streaming and non-streaming alternatives in voice generation.

Video chat

Support video input and real-time response to enhance the interaction experience.

Perform excellently in video understanding tasks.

Multimodal tasks

Multimodal Q&A

Answer questions by combining text, image, audio, and video information.

Achieve state-of-the-art performance in multimodal tasks such as OmniBench.

Voice translation

Support voice input and translate it into other languages.

Perform excellently in translation tasks such as CoVoST2.

🚀 Qwen2.5-Omni-7B GGUF Models

Qwen2.5-Omni-7B GGUF models offer a range of quantization methods and formats to suit different hardware capabilities and memory constraints, enabling efficient and high - precision inference across various devices.

🚀 Quick Start

Model Generation Details

This model was generated using llama.cpp at commit 1f63e75f.

Quantization beyond the IMatrix

Testing a new quantization method using rules to bump important layers above what the standard imatrix would use. The standard IMatrix does not perform very well at low - bit quantization and for MOE models. So, llama.cpp --tensor - type is used to bump up selected layers. See [Layer bumping with llama.cpp](https://github.com/Mungert69/GGUFModelBuilder/blob/main/model - converter/tensor_list_builder.py). This creates larger model files but increases precision for a given model size.

💡 Usage Tip

Please provide feedback on how you find this method performs.

Choosing the Right Model Format

Selecting the correct model format depends on your hardware capabilities and memory constraints.

BF16 (Brain Float 16) – Use if BF16 acceleration is available

A 16 - bit floating - point format designed for faster computation while retaining good precision.
Provides similar dynamic range as FP32 but with lower memory usage.
Recommended if your hardware supports BF16 acceleration (check your device's specs).
Ideal for high - performance inference with reduced memory footprint compared to FP32.

⚠️ Important Note

Use BF16 if: - Your hardware has native BF16 support (e.g., newer GPUs, TPUs). - You want higher precision while saving memory. - You plan to requantize the model into another format.

Avoid BF16 if: - Your hardware does not support BF16 (it may fall back to FP32 and run slower). - You need compatibility with older devices that lack BF16 optimization.

F16 (Float 16) – More widely supported than BF16

A 16 - bit floating - point format with high precision but a smaller range of values than BF16.
Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
Slightly lower numerical precision than BF16 but generally sufficient for inference.

⚠️ Important Note

Use F16 if: - Your hardware supports FP16 but not BF16. - You need a balance between speed, memory usage, and accuracy. - You are running on a GPU or another device optimized for FP16 computations.

Avoid F16 if: - Your device lacks native FP16 support (it may run slower than expected). - You have memory limitations.

Hybrid Precision Models (e.g., `bf16_q8_0`, `f16_q4_K`) – Best of Both Worlds

These formats selectively quantize non - essential layers while keeping key layers in full precision (e.g., attention and output layers).

Named like bf16_q8_0 (meaning full - precision BF16 core layers + quantized Q8_0 other layers).
Strike a balance between memory efficiency and accuracy, improving over fully quantized models without requiring the full memory of BF16/F16.

⚠️ Important Note

Use Hybrid Models if: - You need better accuracy than quant - only models but can't afford full BF16/F16 everywhere. - Your device supports mixed - precision inference. - You want to optimize trade - offs for production - grade models on constrained hardware.

Avoid Hybrid Models if: - Your target device doesn't support mixed or full - precision acceleration. - You are operating under ultra - strict memory limits (in which case use fully quantized formats).

Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low - VRAM Inference

Quantization reduces model size and memory usage while maintaining as much accuracy as possible.

Lower - bit models (Q4_K) – Best for minimal memory usage, may have lower precision.
Higher - bit models (Q6_K, Q8_0) – Better accuracy, requires more memory.

⚠️ Important Note

Use Quantized Models if: - You are running inference on a CPU and need an optimized model. - Your device has low VRAM and cannot load full - precision models. - You want to reduce memory footprint while keeping reasonable accuracy.

Avoid Quantized Models if: - You need maximum accuracy (full - precision models are better for this). - Your hardware has enough VRAM for higher - precision formats (BF16/F16).

Very Low - Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)

These models are optimized for very high memory efficiency, making them ideal for low - power devices or large - scale deployments where memory is a critical constraint.

IQ3_XS: Ultra - low - bit quantization (3 - bit) with very high memory efficiency.
- Use case: Best for ultra - low - memory devices where even Q4_K is too large.
- Trade - off: Lower accuracy compared to higher - bit quantizations.
IQ3_S: Small block size for maximum memory efficiency.
- Use case: Best for low - memory devices where IQ3_XS is too aggressive.
IQ3_M: Medium block size for better accuracy than IQ3_S.
- Use case: Suitable for low - memory devices where IQ3_S is too limiting.
Q4_K: 4 - bit quantization with block - wise optimization for better accuracy.
- Use case: Best for low - memory devices where Q6_K is too large.
Q4_0: Pure 4 - bit quantization, optimized for ARM devices.
- Use case: Best for ARM - based devices or low - memory environments.

Ultra Low - Bit Quantization (IQ1_S IQ1_M IQ2_S IQ2_M IQ2_XS IQ2_XSS)

Ultra - low - bit quantization (1 - 2 - bit) with extreme memory efficiency.
- Use case: Best for cases where you have to fit the model into very constrained memory.
- Trade - off: Very Low Accuracy. May not function as expected. Please test fully before using.

Summary Table: Model Format Selection

Property	Details
Model Format	Precision
BF16	Very High
F16	High
Q4_K	Medium - Low
Q6_K	Medium
Q8_0	High
IQ3_XS	Low
IQ3_S	Low
IQ3_M	Low - Medium
Q4_0	Low
*Ultra Low - Bit (IQ1/2_)**	Very Low
Hybrid (e.g., `bf16_q8_0`)	Medium–High

✨ Features

Overview

Qwen2.5 - Omni is an end - to - end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.

Key Features

Omni and Novel Architecture: We propose Thinker - Talker architecture, an end - to - end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time - aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio.
Real - Time Voice and Video Chat: Architecture designed for fully real - time interactions, supporting chunked input and immediate output.
Natural and Robust Speech Generation: Surpassing many existing streaming and non - streaming alternatives, demonstrating superior robustness and naturalness in speech generation.
Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single - modality models. Qwen2.5 - Omni outperforms the similarly sized Qwen2 - Audio in audio capabilities and achieves comparable performance to Qwen2.5 - VL - 7B.
Excellent End - to - End Speech Instruction Following: Qwen2.5 - Omni shows performance in end - to - end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K.

Model Architecture

Performance

We conducted a comprehensive evaluation of Qwen2.5 - Omni, which demonstrates strong performance across all modalities when compared to similarly sized single - modality models and closed - source models like Qwen2.5 - VL - 7B, Qwen2 - Audio, and Gemini - 1.5 - pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5 - Omni achieves state - of - the - art performance. Furthermore, in single - modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed - tts - eval and subjective naturalness).

Multimodality -> Text

Datasets	Model	Performance
OmniBench Speech \| Sound Event \| Music \| Avg	Gemini-1.5-Pro	42.67%\|42.26%\|46.23%\|42.91%
	MIO-Instruct	36.96%\|33.58%\|11.32%\|33.80%
	AnyGPT (7B)	17.77%\|20.75%\|13.21%\|18.04%
	video-SALMONN	34.11%\|31.70%\|56.60%\|35.64%
	UnifiedIO2-xlarge	39.56%\|36.98%\|29.25%\|38.00%
	UnifiedIO2-xxlarge	34.24%\|36.98%\|24.53%\|33.98%
	MiniCPM-o	-\|-\|-\|40.50%
	Baichuan-Omni-1.5	-\|-\|-\|42.90%
	Qwen2.5-Omni-3B	52.14%\|52.08%\|52.83%\|52.19%
	Qwen2.5-Omni-7B	55.25%\|60.00%\|52.83%\|56.13%

Audio -> Text

Datasets	Model	Performance
ASR
Librispeech dev-clean \| dev other \| test-clean \| test-other	SALMONN	-\|-\|2.1\|4.9
	SpeechVerse	-\|-\|2.1\|4.4
	Whisper-large-v3	-\|-\|1.8\|3.6
	Llama-3-8B	-\|-\|-\|3.4
	Llama-3-70B	-\|-\|-\|3.1
	Seed-ASR-Multilingual	-\|-\|1.6\|2.8
	MiniCPM-o	-\|-\|1.7\|-
	MinMo	-\|-\|1.7\|3.9
	Qwen-Audio	1.8\|4.0\|2.0\|4.2
	Qwen2-Audio	1.3\|3.4\|1.6\|3.6
	Qwen2.5-Omni-3B	2.0\|4.1\|2.2\|4.5
	Qwen2.5-Omni-7B	1.6\|3.5\|1.8\|3.4
Common Voice 15 en \| zh \| yue \| fr	Whisper-large-v3	9.3\|12.8\|10.9\|10.8
	MinMo	7.9\|6.3\|6.4\|8.5
	Qwen2-Audio	8.6\|6.9\|5.9\|9.6
	Qwen2.5-Omni-3B	9.1\|6.0\|11.6\|9.6
	Qwen2.5-Omni-7B	7.6\|5.2\|7.3\|7.5
Fleurs zh \| en	Whisper-large-v3	7.7\|4.1
	Seed-ASR-Multilingual	-\|3.4
	Megrez-3B-Omni	10.8\|-
	MiniCPM-o	4.4\|-
	MinMo	3.0\|3.8
	Qwen2-Audio	7.5\|-
	Qwen2.5-Omni-3B	3.2\|5.4
	Qwen2.5-Omni-7B	3.0\|4.1
Wenetspeech test-net \| test-meeting	Seed-ASR-Chinese	4.7\|5.7
	Megrez-3B-Omni	-\|16.4
	MiniCPM-o	6.9\|-
	MinMo	6.8\|7.4
	Qwen2.5-Omni-3B	6.3\|8.1
	Qwen2.5-Omni-7B	5.9\|7.7
Voxpopuli-V1.0-en	Llama-3-8B	6.2
	Llama-3-70B	5.7
	Qwen2.5-Omni-3B	6.6
	Qwen2.5-Omni-7B	5.8
S2TT
CoVoST2 en-de \| de-en \| en-zh \| zh-en	SALMONN	18.6\|-\|33.1\|-
	SpeechLLaMA	-\|27.1\|-\|12.3
	BLSP	14.1\|-\|-\|-
	MiniCPM-o	-\|-\|48.2\|27.2
	MinMo	-\|39.9\|46.7\|26.0
	Qwen-Audio	25.1\|33.9\|41.5\|15.7
	Qwen2-Audio	29.9\|35.2\|45.2\|24.4
	Qwen2.5-Omni-3B	28.3\|38.1\|41.4\|26.6
	Qwen2.5-Omni-7B	30.2\|37.7\|41.4\|29.4
SER
Meld	WavLM-large	0.542
	MiniCPM-o	0.524
	Qwen-Audio	0.557
	Qwen2-Audio	0.553
	Qwen2.5-Omni-3B	0.558
	Qwen2.5-Omni-7B	0.570
VSC
VocalSound	CLAP	0.495
	Pengi	0.604
	Qwen-Audio	0.929
	Qwen2-Audio	0.939
	Qwen2.5-Omni-3B	0.936
	Qwen2.5-Omni-7B	0.939
Music
GiantSteps Tempo	Llark-7B	0.86
	Qwen2.5-Omni-3B	0.88

📄 License

This project is licensed under the Apache 2.0 License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご