Q

Qwen2.5 Omni 7B GPTQ Int4

Developed by Qwen
Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving various modalities such as text, images, audio, and video, and generating text and natural speech responses in a streaming manner.
Downloads 389
Release Time : 5/14/2025

Model Overview

Qwen2.5-Omni is an end-to-end multimodal model designed for real-time interaction, supporting the perception and generation of text, images, audio, and video.

Model Features

Omni-modal and Novel Architecture
Supports perception and generation of text, images, audio, and video, employing the Thinker-Talker architecture and TMRoPE positional embeddings.
Real-time Voice and Video Chat
Designed for fully real-time interaction, supporting chunked input and instant output.
Natural and Robust Speech Generation
Demonstrates exceptional robustness and naturalness in speech generation, surpassing many existing streaming and non-streaming alternatives.
Strong Cross-modal Performance
Exhibits outstanding performance across all modalities, competitive with single-modal models of similar scale.
End-to-end Voice Instruction Following
Excels in end-to-end voice instruction following, achieving results comparable to text input.

Model Capabilities

Text Generation
Image Analysis
Speech Recognition
Speech Synthesis
Video Analysis

Use Cases

Real-time Interaction
Real-time Voice Chat
Supports real-time voice input and output, suitable for applications like voice assistants.
Natural and robust speech generation effects.
Video Analysis
Supports real-time analysis and response to video content.
Achieves 72.4% accuracy on the VideoMME benchmark.
Speech Processing
Speech Recognition
Supports high-precision speech-to-text conversion.
Achieves a WER of 3.4 on the LibriSpeech test-other dataset.
Speech Synthesis
Supports the generation of natural speech.
Achieves a WER of 8.7 on the Seed-TTS test-hard dataset.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase