Qwen2-Audio開源大音頻語言模型 - 支持多音頻輸入，可執行分析與文本生成

首頁

Qwen 2 Audio Instruct Dynamic Fp8

由mlinmg開發

Qwen2-Audio是Qwen大音頻語言模型系列的最新版本，能夠接收多種音頻信號輸入，並根據語音指令執行音頻分析或直接生成文本響應。

文本生成音頻

Transformers

英語開源協議:Apache-2.0 #多模態音頻理解 #語音交互助手 #音頻文本轉換

下載量 24

發布時間 : 4/24/2025

模型概述

Qwen2-Audio支持語音聊天和音頻分析兩種交互模式，能夠處理音頻輸入並生成文本響應，適用於多種音頻理解任務。

模型特點

多模式交互

支持語音聊天和音頻分析兩種交互模式，用戶可以通過語音或文本指令與模型交互。

音頻理解

能夠處理多種音頻信號輸入，包括語音、環境音等，並進行理解和分析。

文本生成

根據音頻輸入生成自然語言文本響應，適用於對話和問答場景。

模型能力

音頻理解

文本生成

語音交互

音頻分析

使用案例

語音交互

語音聊天

用戶無需輸入文本，即可與模型進行自由語音交互。

生成自然語言文本響應

音頻分析

音頻內容理解

用戶提供音頻和文本指令，模型進行分析並生成響應。

識別音頻內容並生成描述

🚀 通義千問/Qwen2-Audio-7B-Instruct-FP8

通義千問2-Audio（Qwen2-Audio）是全新系列的大型音頻語言模型，能夠接收多種音頻信號輸入，並根據語音指令進行音頻分析或直接給出文本回復。本項目提供了語音聊天和音頻分析兩種不同的音頻交互模式，同時發佈了預訓練模型通義千問2-Audio-7B和聊天模型通義千問2-Audio-7B-Instruct。

🚀 快速開始

若要在 vllm 中啟動，請運行以下命令：

vllm serve mlinmg/Qwen-2-Audio-Instruct-dynamic-fp8

✨ 主要特性

通義千問2-Audio 具備以下兩種音頻交互模式：

語音聊天：用戶無需輸入文本，即可與通義千問2-Audio 自由進行語音交互。
音頻分析：用戶在交互過程中可提供音頻和文本指令進行分析。

📦 安裝指南

通義千問2-Audio 的代碼已集成在最新的 Hugging face transformers 中，建議使用以下命令從源代碼進行構建，否則可能會遇到 KeyError: 'qwen2-audio' 錯誤：

pip install git+https://github.com/huggingface/transformers

💻 使用示例

基礎用法

語音聊天推理

在語音聊天模式下，用戶無需輸入文本，即可與通義千問2-Audio 自由進行語音交互：

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
    ]},
    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(librosa.load(
                    BytesIO(urlopen(ele['audio_url']).read()), 
                    sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

音頻分析推理

在音頻分析模式下，用戶可提供音頻和文本指令進行分析：

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'}, 
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()), 
                        sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

高級用法

批量推理

本項目還支持批量推理：

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation1 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},
        {"type": "text", "text": "What can you hear?"},
    ]}
]

conversation2 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]

conversations = [conversation1, conversation2]

text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]

audios = []
for conversation in conversations:
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audios.append(
                        librosa.load(
                            BytesIO(urlopen(ele['audio_url']).read()), 
                            sr=processor.feature_extractor.sampling_rate)[0]
                    )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs['input_ids'] = inputs['input_ids'].to("cuda")
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

📚 詳細文檔

如需瞭解更多詳細信息，請參考我們的博客、GitHub 和報告。

📄 許可證

本項目採用 Apache-2.0 許可證。

📖 引用

如果您覺得我們的工作有幫助，請引用以下文獻：

@article{Qwen2-Audio,
  title={Qwen2-Audio Technical Report},
  author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo,  Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2407.10759},
  year={2024}
}

@article{Qwen-Audio,
  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2311.07919},
  year={2023}
}