Qwen2.5-Omni-3B開源多模態模型 - 感知文本音視頻並同步生成語音響應

首頁

Qwen2.5 Omni 3B

由Qwen開發

Qwen2.5-Omni是一款端到端多模態模型，能夠感知文本、圖像、音頻和視頻等多種模態信息，並以流式方式同步生成文本和自然語音響應。

多模態融合

Transformers

英語開源協議:其他 #端到端多模態 #即時語音視頻 #跨模態整合

下載量 48.07k

發布時間 : 4/30/2025

模型概述

Qwen2.5-Omni是一款創新的多模態模型，採用Thinker-Talker架構設計，支持即時音視頻交互和自然流暢的語音生成，在跨模態任務中表現優異。

模型特點

創新架構設計

提出Thinker-Talker架構，實現端到端多模態感知與生成。創新性地引入TMRoPE（時間對齊多模態旋轉位置編碼），確保視頻與音頻輸入的時間戳同步。

即時音視頻交互

支持分塊輸入與即時輸出的全即時交互架構。

自然流暢的語音生成

在語音生成的自然度和魯棒性上超越現有流式/非流式方案。

跨模態強勁表現

在同等規模單模態模型對比中全面領先。音頻能力超越同尺寸Qwen2-Audio，視覺表現媲美Qwen2.5-VL-7B。

卓越的端到端語音指令跟隨

在MMLU、GSM8K等基準測試中，語音指令跟隨能力達到文本輸入同等效果。

模型能力

文本理解與生成

圖像理解與分析

音頻理解與生成

視頻理解與分析

多模態融合處理

即時流式交互

使用案例

智能助手

多模態對話系統

支持文本、語音、圖像和視頻的多模態交互

提供更自然流暢的人機交互體驗

內容創作

多媒體內容生成

根據多模態輸入生成連貫的文本和語音輸出

提升內容創作的效率和質量

教育

多模態學習助手

通過語音、圖像和視頻等多種方式輔助學習

提供更豐富的學習體驗

🚀 Qwen2.5-Omni

Qwen2.5-Omni是一個端到端的多模態模型，能夠感知多種模態信息，包括文本、圖像、音頻和視頻，同時以流式方式生成文本和自然語音響應，為多模態交互提供了強大的支持。

🚀 快速開始

我們提供了簡單的示例，展示如何使用🤗 Transformers來使用Qwen2.5-Omni。Qwen2.5-Omni的代碼已集成在最新的Hugging face transformers中，建議使用以下命令從源代碼構建：

pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
pip install accelerate

否則可能會遇到以下錯誤：

KeyError: 'qwen2_5_omni'

我們還提供了一個工具包，幫助你更方便地處理各種類型的音頻和視覺輸入，就像使用API一樣。它支持base64、URL以及交錯的音頻、圖像和視頻。你可以使用以下命令安裝，並確保你的系統已安裝ffmpeg：

# 強烈建議使用 `[decord]` 特性以加快視頻加載速度
pip install qwen-omni-utils[decord] -U

如果你不使用Linux系統，可能無法從PyPI安裝decord。這種情況下，你可以使用pip install qwen-omni-utils -U，它將回退到使用torchvision進行視頻處理。不過，你仍然可以從源代碼安裝decord，以便在加載視頻時使用decord。

🤗 Transformers使用方法

以下是一個代碼片段，展示如何使用transformers和qwen_omni_utils來使用聊天模型：

import soundfile as sf

from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

# 默認：將模型加載到可用設備上
model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-3B", torch_dtype="auto", device_map="auto")

# 建議啟用 flash_attention_2 以獲得更好的加速和內存節省效果
# model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-Omni-3B",
#     torch_dtype="auto",
#     device_map="auto",
#     attn_implementation="flash_attention_2",
# )

processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-3B")

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"},
        ],
    },
]

# 設置是否使用視頻中的音頻
USE_AUDIO_IN_VIDEO = True

# 推理前的準備工作
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# 推理：生成輸出文本和音頻
text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)

text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)
sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)

最小GPU內存要求

模型	精度	15秒視頻	30秒視頻	60秒視頻
Qwen-Omni-3B	FP32	89.10 GB	不推薦	不推薦
Qwen-Omni-3B	BF16	18.38 GB	22.43 GB	28.22 GB
Qwen-Omni-7B	FP32	93.56 GB	不推薦	不推薦
Qwen-Omni-7B	BF16	31.11 GB	41.85 GB	60.19 GB

注意：上表展示了使用transformers進行推理的理論最小內存要求，並且BF16是在attn_implementation="flash_attention_2"的情況下測試的；然而，在實際應用中，實際內存使用量通常至少是理論值的1.2倍。更多信息請參考此處鏈接。

視頻URL資源使用

視頻URL的兼容性在很大程度上取決於第三方庫的版本。具體細節如下表所示。如果你不想使用默認的後端，可以通過FORCE_QWENVL_VIDEO_READER=torchvision或FORCE_QWENVL_VIDEO_READER=decord來更改。

後端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

當設置return_audio=False時，模型可以將由文本、圖像、音頻和視頻等各種類型的混合樣本組成的輸入進行批量處理。以下是一個示例：

# 批量推理的示例消息

# 僅包含視頻的對話
conversation1 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
        ]
    }
]

# 僅包含音頻的對話
conversation2 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "/path/to/audio.wav"},
        ]
    }
]

# 純文本對話
conversation3 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": "who are you?"
    }
]

# 包含多種媒體的對話
conversation4 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "/path/to/image.jpg"},
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "audio", "audio": "/path/to/audio.wav"},
            {"type": "text", "text": "What are the elements can you see and hear in these medias?"},
        ],
    }
]

# 合併消息以進行批量處理
conversations = [conversation1, conversation2, conversation3, conversation4]

# 設置是否使用視頻中的音頻
USE_AUDIO_IN_VIDEO = True

# 批量推理前的準備工作
text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)

inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# 批量推理
text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, return_audio=False)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)

使用提示

音頻輸出提示

如果用戶需要音頻輸出，系統提示必須設置為 "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."，否則音頻輸出可能無法按預期工作。

{
    "role": "system",
    "content": [
        {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
    ],
}

使用視頻中的音頻

在多模態交互過程中，用戶提供的視頻通常會附帶音頻（例如關於視頻內容的問題，或視頻中某些事件產生的聲音）。這些信息有助於模型提供更好的交互體驗。因此，我們為用戶提供了以下選項，以決定是否使用視頻中的音頻：

# 第一步，數據預處理
audios, images, videos = process_mm_info(conversations, use_audio_in_video=True)

# 第二步，模型處理器
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", 
                   padding=True, use_audio_in_video=True)

# 第三步，模型推理
text_ids, audio = model.generate(**inputs, use_audio_in_video=True)

值得注意的是，在多輪對話中，這些位置的use_audio_in_video參數必須設置為相同的值，否則可能會出現意外結果。

是否使用音頻輸出

模型支持文本和音頻輸出。如果用戶不需要音頻輸出，可以在初始化模型後調用model.disable_talker()。此選項將節省約~2GB的GPU內存，但generate函數的return_audio選項將只能設置為False。

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-3B",
    torch_dtype="auto",
    device_map="auto"
)
model.disable_talker()

為了獲得更靈活的體驗，我們建議用戶在調用generate函數時決定是否返回音頻。如果將return_audio設置為False，模型將僅返回文本輸出，從而更快地獲得文本響應。

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-3B",
    torch_dtype="auto",
    device_map="auto"
)
...
text_ids = model.generate(**inputs, return_audio=False)

更改輸出音頻的語音類型

Qwen2.5-Omni支持更改輸出音頻的語音類型。"Qwen/Qwen2.5-Omni-3B"檢查點支持以下兩種語音類型：

語音類型	性別	描述
Chelsie	女性	一種甜美的、天鵝絨般的聲音，帶有溫柔的溫暖和明亮的清晰度。
Ethan	男性	一種明亮、樂觀的聲音，充滿感染力的活力和溫暖、親切的氛圍。

用戶可以使用generate函數的speaker參數來指定語音類型。默認情況下，如果未指定speaker，默認語音類型為Chelsie。

text_ids, audio = model.generate(**inputs, speaker="Chelsie")

text_ids, audio = model.generate(**inputs, speaker="Ethan")

使用Flash-Attention 2加速生成

首先，確保安裝最新版本的Flash Attention 2：

pip install -U flash-attn --no-build-isolation

此外，你需要有與FlashAttention 2兼容的硬件。更多信息請閱讀flash attention倉庫的官方文檔。FlashAttention-2僅在模型以torch.float16或torch.bfloat16加載時可用。

要使用FlashAttention-2加載和運行模型，在加載模型時添加attn_implementation="flash_attention_2"：

from transformers import Qwen2_5OmniForConditionalGeneration

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-3B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

✨ 主要特性

全模態與新穎架構：我們提出了Thinker-Talker架構，這是一種端到端的多模態模型，旨在感知多種模態，包括文本、圖像、音頻和視頻，同時以流式方式生成文本和自然語音響應。我們還提出了一種新穎的位置嵌入，名為TMRoPE（時間對齊的多模態RoPE），用於同步視頻輸入與音頻的時間戳。
即時語音和視頻聊天：該架構專為完全即時交互而設計，支持分塊輸入和即時輸出。
自然且穩健的語音生成：超越了許多現有的流式和非流式替代方案，在語音生成方面表現出卓越的穩健性和自然度。
跨模態的強大性能：在與同等規模的單模態模型進行基準測試時，在所有模態上均表現出色。Qwen2.5-Omni在音頻能力上優於同等規模的Qwen2-Audio，並達到了與Qwen2.5-VL-7B相當的性能。
出色的端到端語音指令遵循能力：Qwen2.5-Omni在端到端語音指令遵循方面的表現與其在文本輸入時的有效性相當，這在MMLU和GSM8K等基準測試中得到了證明。

模型架構

性能

我們對Qwen2.5-Omni進行了全面評估，結果表明，與同等規模的單模態模型和閉源模型（如Qwen2.5-VL-7B、Qwen2-Audio和Gemini-1.5-pro）相比，它在所有模態上都表現出強大的性能。在需要整合多種模態的任務中，如OmniBench，Qwen2.5-Omni達到了最先進的性能。此外，在單模態任務中，它在語音識別（Common Voice）、翻譯（CoVoST2）、音頻理解（MMAU）、圖像推理（MMMU、MMStar）、視頻理解（MVBench）和語音生成（Seed-tts-eval和主觀自然度）等方面表現出色。

多模態 -> 文本

數據集	模型	性能
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	Gemini-1.5-Pro	42.67% \| 42.26% \| 46.23% \| 42.91%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	MIO-Instruct	36.96% \| 33.58% \| 11.32% \| 33.80%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	AnyGPT (7B)	17.77% \| 20.75% \| 13.21% \| 18.04%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	video-SALMONN	34.11% \| 31.70% \| 56.60% \| 35.64%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	UnifiedIO2-xlarge	39.56% \| 36.98% \| 29.25% \| 38.00%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	UnifiedIO2-xxlarge	34.24% \| 36.98% \| 24.53% \| 33.98%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	MiniCPM-o	- \| - \| - \| 40.50%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	Baichuan-Omni-1.5	- \| - \| - \| 42.90%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	Qwen2.5-Omni-3B	52.14% \| 52.08% \| 52.83% \| 52.19%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	Qwen2.5-Omni-7B	55.25% \| 60.00% \| 52.83% \| 56.13%

音頻 -> 文本

數據集	模型	性能
自動語音識別（ASR）
Librispeech dev-clean \| dev other \| test-clean \| test-other	SALMONN	- \| - \| 2.1 \| 4.9
Librispeech dev-clean \| dev other \| test-clean \| test-other	SpeechVerse	- \| - \| 2.1 \| 4.4
Librispeech dev-clean \| dev other \| test-clean \| test-other	Whisper-large-v3	- \| - \| 1.8 \| 3.6
Librispeech dev-clean \| dev other \| test-clean \| test-other	Llama-3-8B	- \| - \| - \| 3.4
Librispeech dev-clean \| dev other \| test-clean \| test-other	Llama-3-70B	- \| - \| - \| 3.1
Librispeech dev-clean \| dev other \| test-clean \| test-other	Seed-ASR-Multilingual	- \| - \| 1.6 \| 2.8
Librispeech dev-clean \| dev other \| test-clean \| test-other	MiniCPM-o	- \| - \| 1.7 \| -
Librispeech dev-clean \| dev other \| test-clean \| test-other	MinMo	- \| - \| 1.7 \| 3.9
Librispeech dev-clean \| dev other \| test-clean \| test-other	Qwen-Audio	1.8 \| 4.0 \| 2.0 \| 4.2
Librispeech dev-clean \| dev other \| test-clean \| test-other	Qwen2-Audio	1.3 \| 3.4 \| 1.6 \| 3.6
Librispeech dev-clean \| dev other \| test-clean \| test-other	Qwen2.5-Omni-3B	2.0 \| 4.1 \| 2.2 \| 4.5
Librispeech dev-clean \| dev other \| test-clean \| test-other	Qwen2.5-Omni-7B	1.6 \| 3.5 \| 1.8 \| 3.4
Common Voice 15 en \| zh \| yue \| fr	Whisper-large-v3	9.3 \| 12.8 \| 10.9 \| 10.8
Common Voice 15 en \| zh \| yue \| fr	MinMo	7.9 \| 6.3 \| 6.4 \| 8.5
Common Voice 15 en \| zh \| yue \| fr	Qwen2-Audio	8.6 \| 6.9 \| 5.9 \| 9.6
Common Voice 15 en \| zh \| yue \| fr	Qwen2.5-Omni-3B	9.1 \| 6.0 \| 11.6 \| 9.6
Common Voice 15 en \| zh \| yue \| fr	Qwen2.5-Omni-7B	7.6 \| 5.2 \| 7.3 \| 7.5
Fleurs zh \| en	Whisper-large-v3	7.7 \| 4.1
Fleurs zh \| en	Seed-ASR-Multilingual	- \| 3.4
Fleurs zh \| en	Megrez-3B-Omni	10.8 \| -
Fleurs zh \| en	MiniCPM-o	4.4 \| -
Fleurs zh \| en	MinMo	3.0 \| 3.8
Fleurs zh \| en	Qwen2-Audio	7.5 \| -
Fleurs zh \| en	Qwen2.5-Omni-3B	3.2 \| 5.4
Fleurs zh \| en	Qwen2.5-Omni-7B	3.0 \| 4.1
Wenetspeech test-net \| test-meeting	Seed-ASR-Chinese	4.7 \| 5.7
Wenetspeech test-net \| test-meeting	Megrez-3B-Omni	- \| 16.4
Wenetspeech test-net \| test-meeting	MiniCPM-o	6.9 \| -
Wenetspeech test-net \| test-meeting	MinMo	6.8 \| 7.4
Wenetspeech test-net \| test-meeting	Qwen2.5-Omni-3B	6.3 \| 8.1
Wenetspeech test-net \| test-meeting	Qwen2.5-Omni-7B	5.9 \| 7.7
Voxpopuli-V1.0-en	Llama-3-8B	6.2
Voxpopuli-V1.0-en	Llama-3-70B	5.7
Voxpopuli-V1.0-en	Qwen2.5-Omni-3B	6.6
Voxpopuli-V1.0-en	Qwen2.5-Omni-7B	5.8
語音到文本翻譯（S2TT）
CoVoST2 en-de \| de-en \| en-zh \| zh-en	SALMONN	18.6 \| - \| 33.1 \| -
CoVoST2 en-de \| de-en \| en-zh \| zh-en	SpeechLLaMA	- \| 27.1 \| - \| 12.3
CoVoST2 en-de \| de-en \| en-zh \| zh-en	BLSP	14.1 \| - \| - \| -
CoVoST2 en-de \| de-en \| en-zh \| zh-en	MiniCPM-o	- \| - \| 48.2 \| 27.2
CoVoST2 en-de \| de-en \| en-zh \| zh-en	MinMo	- \| 39.9 \| 46.7 \| 26.0
CoVoST2 en-de \| de-en \| en-zh \| zh-en	Qwen-Audio	25.1 \| 33.9 \| 41.5 \| 15.7
CoVoST2 en-de \| de-en \| en-zh \| zh-en	Qwen2-Audio	29.9 \| 35.2 \| 45.2 \| 24.4
CoVoST2 en-de \| de-en \| en-zh \| zh-en	Qwen2.5-Omni-3B	28.3 \| 38.1 \| 41.4 \| 26.6
CoVoST2 en-de \| de-en \| en-zh \| zh-en	Qwen2.5-Omni-7B	30.2 \| 37.7 \| 41.4 \| 29.4
語音情感識別（SER）
Meld	WavLM-large	0.542
Meld	MiniCPM-o	0.524
Meld	Qwen-Audio	0.557
Meld	Qwen2-Audio	0.553
Meld	Qwen2.5-Omni-3B	0.558
Meld	Qwen2.5-Omni-7B	0.570
語音聲音分類（VSC）
VocalSound	CLAP	0.495
VocalSound	Pengi	0.604
VocalSound	Qwen-Audio	0.929
VocalSound	Qwen2-Audio	0.939
VocalSound	Qwen2.5-Omni-3B	0.936
VocalSound	Qwen2.5-Omni-7B	0.939
音樂
GiantSteps Tempo	Llark-7B	0.86
GiantSteps Tempo	Qwen2.5-Omni-3B	0.88
GiantSteps Tempo	Qwen2.5-Omni-7B	0.88
MusicCaps	LP-MusicCaps	0.291 \| 0.149 \| 0.089 \| 0.061 \| 0.129 \| 0.130
MusicCaps	Qwen2.5-Omni-3B	0.325 \| 0.163 \| 0.093 \| 0.057 \| 0.132 \| 0.229
MusicCaps	Qwen2.5-Omni-7B	0.328 \| 0.162 \| 0.090 \| 0.055 \| 0.127 \| 0.225
音頻推理
MMAU 聲音 \| 音樂 \| 語音 \| 平均	Gemini-Pro-V1.5	56.75 \| 49.40 \| 58.55 \| 54.90
MMAU 聲音 \| 音樂 \| 語音 \| 平均	Qwen2-Audio	54.95 \| 50.98 \| 42.04 \| 49.20
MMAU 聲音 \| 音樂 \| 語音 \| 平均	Qwen2.5-Omni-3B	70.27 \| 60.48 \| 59.16 \| 63.30
MMAU 聲音 \| 音樂 \| 語音 \| 平均	Qwen2.5-Omni-7B	67.87 \| 69.16 \| 59.76 \| 65.60
語音聊天
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	Ultravox-v0.4.1-LLaMA-3.1-8B	4.55 \| 3.90 \| 53.35 \| 47.17
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	MERaLiON	4.50 \| 3.77 \| 55.06 \| 34.95
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	Megrez-3B-Omni	3.50 \| 2.95 \| 25.95 \| 27.03
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	Lyra-Base	3.85 \| 3.50 \| 38.25 \| 49.74
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	MiniCPM-o	4.42 \| 4.15 \| 50.72 \| 54.78
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	Baichuan-Omni-1.5	4.50 \| 4.05 \| 43.40 \| 57.25
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	Qwen2-Audio	3.74 \| 3.43 \| 35.71 \| 35.72
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	Qwen2.5-Omni-3B	4.32 \| 4.00 \| 49.37 \| 50.23
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	Qwen2.5-Omni-7B	4.49 \| 3.93 \| 55.71 \| 61.32
VoiceBench OpenBookQA \| IFEval \| AdvBench \| 平均	Ultravox-v0.4.1-LLaMA-3.1-8B	65.27 \| 66.88 \| 98.46 \| 71.45
VoiceBench OpenBookQA \| IFEval \| AdvBench \| 平均	MERaLiON	27.23 \| 62.93 \| 94.81 \| 62.91
VoiceBench OpenBookQA \| IFEval \| AdvBench \| 平均	Megrez-3B-Omni	28.35 \| 25.71 \| 87.69 \| 46.25
VoiceBench OpenBookQA \| IFEval \| AdvBench \| 平均	Lyra-Base	72.75 \| 36.28 \| 59.62 \| 57.66
VoiceBench OpenBookQA \| IFEval \| AdvBench \| 平均	MiniCPM-o	78.02 \| 49.25 \| 97.69 \| 71.69
VoiceBench OpenBookQA \| IFEval \| AdvBench \| 平均	Baichuan-Omni-1.5	74.51 \| 54.54 \| 97.31 \| 71.14
VoiceBench OpenBookQA \| IFEval \| AdvBench \| 平均	Qwen2-Audio	49.45 \| 26.33 \| 96.73 \| 55.35
VoiceBench OpenBookQA \| IFEval \| AdvBench \| 平均	Qwen2.5-Omni-3B	74.73 \| 42.10 \| 98.85 \| 68.81
VoiceBench OpenBookQA \| IFEval \| AdvBench \| 平均	Qwen2.5-Omni-7B	81.10 \| 52.87 \| 99.42 \| 74.12

圖像 -> 文本

數據集	Qwen2.5-Omni-7B	Qwen2.5-Omni-3B	其他最佳	Qwen2.5-VL-7B	GPT-4o-mini
MMMU_val	59.2	53.1	53.9	58.6	60.0
MMMU-Pro_overall	36.6	29.7	-	38.3	37.6
MathVista_testmini	67.9	59.4	71.9	68.2	52.5
MathVision_full	25.0	20.8	23.1	25.1	-
MMBench-V1.1-EN_test	81.8	77.8	80.5	82.6	76.0
MMVet_turbo	66.8	62.1	67.5	67.1	66.9
MMStar	64.0	55.7	64.0	63.9	54.8
MME_sum	2340	2117	2372	2347	2003
MuirBench	59.2	48.0	-	59.2	-
CRPE_relation	76.5	73.7	-	76.4	-
RealWorldQA_avg	70.3	62.6	71.9	68.5	-
MME-RealWorld_en	61.6	55.6	-	57.4	-
MM-MT-Bench	6.0	5.0	-	6.3	-
AI2D	83.2	79.5	85.8	83.9	-
TextVQA_val	84.4	79.8	83.2	84.9	-
DocVQA_test	95.2	93.3	93.5	95.7	-
ChartQA_{test Avg}	85.3	82.8	84.9	87.3	-
OCRBench_V2_en	57.8	51.7	-	56.3	-

數據集	Qwen2.5-Omni-7B	Qwen2.5-Omni-3B	Qwen2.5-VL-7B	Grounding DINO	Gemini 1.5 Pro
Refcoco_val	90.5	88.7	90.0	90.6	73.2
Refcoco_textA	93.5	91.8	92.5	93.2	72.9
Refcoco_textB	86.6	84.0	85.4	88.2	74.6
Refcoco+_val	85.4	81.1	84.2	88.2	62.5
Refcoco+_textA	91.0	87.5	89.1	89.0	63.9
Refcoco+_textB	79.3	73.2	76.9	75.9	65.0
Refcocog+_val	87.4	85.0	87.2	86.1	75.2
Refcocog+_test	87.9	85.1	87.2	87.0	76.2
ODinW	42.4	39.2	37.3	55.0	36.7
PointGrounding	66.5	46.2	67.3	-	-

視頻（無音頻） -> 文本

數據集	Qwen2.5-Omni-7B	Qwen2.5-Omni-3B	其他最佳	Qwen2.5-VL-7B	GPT-4o-mini
Video-MME_{w/o sub}	64.3	62.0	63.9	65.1	64.8
Video-MME_{w sub}	72.4	68.6	67.9	71.6	-
MVBench	70.3	68.7	67.2	69.6	-
EgoSchema_test	68.6	61.4	63.2	65.0	-

零樣本語音生成

數據集	模型	性能
內容一致性
SEED test-zh \| test-en \| test-hard	Seed-TTS_ICL	1.11 \| 2.24 \| 7.58
SEED test-zh \| test-en \| test-hard	Seed-TTS_RL	1.00 \| 1.94 \| 6.42
SEED test-zh \| test-en \| test-hard	MaskGCT	2.27 \| 2.62 \| 10.27
SEED test-zh \| test-en \| test-hard	E2_TTS	1.97 \| 2.19 \| -
SEED test-zh \| test-en \| test-hard	F5-TTS	1.56 \| 1.83 \| 8.67
SEED test-zh \| test-en \| test-hard	CosyVoice 2	1.45 \| 2.57 \| 6.83
SEED test-zh \| test-en \| test-hard	CosyVoice 2-S	1.45 \| 2.38 \| 8.08
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-3B_ICL	1.95 \| 2.87 \| 9.92
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-3B_RL	1.58 \| 2.51 \| 7.86
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-7B_ICL	1.70 \| 2.72 \| 7.97
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-7B_RL	1.42 \| 2.32 \| 6.54
說話人相似度
SEED test-zh \| test-en \| test-hard	Seed-TTS_ICL	0.796 \| 0.762 \| 0.776
SEED test-zh \| test-en \| test-hard	Seed-TTS_RL	0.801 \| 0.766 \| 0.782
SEED test-zh \| test-en \| test-hard	MaskGCT	0.774 \| 0.714 \| 0.748
SEED test-zh \| test-en \| test-hard	E2_TTS	0.730 \| 0.710 \| -
SEED test-zh \| test-en \| test-hard	F5-TTS	0.741 \| 0.647 \| 0.713
SEED test-zh \| test-en \| test-hard	CosyVoice 2	0.748 \| 0.652 \| 0.724
SEED test-zh \| test-en \| test-hard	CosyVoice 2-S	0.753 \| 0.654 \| 0.732
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-3B_ICL	0.741 \| 0.635 \| 0.748
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-3B_RL	0.744 \| 0.635 \| 0.746
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-7B_ICL	0.752 \| 0.632 \| 0.747
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-7B_RL	0.754 \| 0.641 \| 0.752

文本 -> 文本

數據集	Qwen2.5-Omni-7B	Qwen2.5-Omni-3B	Qwen2.5-7B	Qwen2.5-3B	Qwen2-7B	Llama3.1-8B	Gemma2-9B
MMLU-Pro	47.0	40.4	56.3	43.7	44.1	48.3	52.1
MMLU-redux	71.0	60.9	75.4	64.4	67.3	67.2	72.8
LiveBench₀₈₃₁	29.6	22.3	35.9	26.8	29.2	26.7	30.6
GPQA	30.8	34.3	36.4	30.3	34.3	32.8	32.8
MATH	71.5	63.6	75.5	65.9	52.9	51.9	44.3
GSM8K	88.7	82.6	91.6	86.7	85.7	84.5	76.7
HumanEval	78.7	70.7	84.8	74.4	79.9	72.6	68.9
MBPP	73.2	70.4	79.2	72.7	67.2	69.6	74.9
MultiPL-E	65.8	57.6	70.4	60.2	59.1	50.7	53.4
LiveCodeBench_2305-2409	24.6	16.5	28.7	19.9	23.9	8.3	18.9

📄 許可證

本項目採用qwen-research許可證，詳情請見LICENSE。

引用

如果您發現我們的論文和代碼在您的研究中很有用，請考慮給我們一個星星 :star: 並進行引用 :pencil: :)

@article{Qwen2.5-Omni,
  title={Qwen2.5-Omni Technical Report},
  author={Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin},
  journal={arXiv preprint arXiv:2503.20215},
  year={2025}
}

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct

專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務

Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型，專為邊緣設備推理設計，體積僅為Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基於RoBERTa架構的中文抽取式問答模型，適用於從給定文本中提取答案的任務。

智啟未來，您的人工智能解決方案智庫

數據集	模型	性能
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	Gemini-1.5-Pro	42.67% \| 42.26% \| 46.23% \| 42.91%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	MIO-Instruct	36.96% \| 33.58% \| 11.32% \| 33.80%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	AnyGPT (7B)	17.77% \| 20.75% \| 13.21% \| 18.04%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	video-SALMONN	34.11% \| 31.70% \| 56.60% \| 35.64%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	UnifiedIO2-xlarge	39.56% \| 36.98% \| 29.25% \| 38.00%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	UnifiedIO2-xxlarge	34.24% \| 36.98% \| 24.53% \| 33.98%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	MiniCPM-o	- \| - \| - \| 40.50%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	Baichuan-Omni-1.5	- \| - \| - \| 42.90%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	Qwen2.5-Omni-3B	52.14% \| 52.08% \| 52.83% \| 52.19%
OmniBench 語音 \| 聲音事件 \| 音樂 \| 平均	Qwen2.5-Omni-7B	55.25% \| 60.00% \| 52.83% \| 56.13%

數據集	模型	性能
內容一致性
SEED test-zh \| test-en \| test-hard	Seed-TTS_ICL	1.11 \| 2.24 \| 7.58
SEED test-zh \| test-en \| test-hard	Seed-TTS_RL	1.00 \| 1.94 \| 6.42
SEED test-zh \| test-en \| test-hard	MaskGCT	2.27 \| 2.62 \| 10.27
SEED test-zh \| test-en \| test-hard	E2_TTS	1.97 \| 2.19 \| -
SEED test-zh \| test-en \| test-hard	F5-TTS	1.56 \| 1.83 \| 8.67
SEED test-zh \| test-en \| test-hard	CosyVoice 2	1.45 \| 2.57 \| 6.83
SEED test-zh \| test-en \| test-hard	CosyVoice 2-S	1.45 \| 2.38 \| 8.08
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-3B_ICL	1.95 \| 2.87 \| 9.92
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-3B_RL	1.58 \| 2.51 \| 7.86
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-7B_ICL	1.70 \| 2.72 \| 7.97
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-7B_RL	1.42 \| 2.32 \| 6.54
說話人相似度
SEED test-zh \| test-en \| test-hard	Seed-TTS_ICL	0.796 \| 0.762 \| 0.776
SEED test-zh \| test-en \| test-hard	Seed-TTS_RL	0.801 \| 0.766 \| 0.782
SEED test-zh \| test-en \| test-hard	MaskGCT	0.774 \| 0.714 \| 0.748
SEED test-zh \| test-en \| test-hard	E2_TTS	0.730 \| 0.710 \| -
SEED test-zh \| test-en \| test-hard	F5-TTS	0.741 \| 0.647 \| 0.713
SEED test-zh \| test-en \| test-hard	CosyVoice 2	0.748 \| 0.652 \| 0.724
SEED test-zh \| test-en \| test-hard	CosyVoice 2-S	0.753 \| 0.654 \| 0.732
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-3B_ICL	0.741 \| 0.635 \| 0.748
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-3B_RL	0.744 \| 0.635 \| 0.746
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-7B_ICL	0.752 \| 0.632 \| 0.747
SEED test-zh \| test-en \| test-hard	Qwen2.5-Omni-7B_RL	0.754 \| 0.641 \| 0.752