Qwen2-Audioオープンソースビッグオーディオ言語モデル - 複数のオーディオ入力をサポートし、分析とテキスト生成を実行可能

ホーム

Qwen 2 Audio Instruct Dynamic Fp8

mlinmgによって開発

Qwen2-AudioはQwen大規模音声言語モデルシリーズの最新バージョンで、複数の音声信号入力を処理し、音声指示に基づいて音声分析を実行したり、直接テキスト応答を生成したりできます。

テキスト生成オーディオ

Transformers

英語オープンソースライセンス:Apache-2.0 #マルチモーダル音声理解 #音声インタラクションアシスタント #音声テキスト変換

ダウンロード数 24

リリース時間 : 4/24/2025

モデル概要

Qwen2-Audioは音声チャットと音声分析の2つのインタラクションモードをサポートし、音声入力を処理してテキスト応答を生成することができ、さまざまな音声理解タスクに適しています。

モデル特徴

マルチモードインタラクション

音声チャットと音声分析の2つのインタラクションモードをサポートし、ユーザーは音声またはテキスト指示でモデルと対話できます。

音声理解

音声、環境音など、さまざまな音声信号入力を処理し、理解と分析を行うことができます。

テキスト生成

音声入力に基づいて自然言語のテキスト応答を生成し、対話や質問応答のシナリオに適しています。

モデル能力

音声理解

テキスト生成

音声インタラクション

音声分析

使用事例

音声インタラクション

音声チャット

ユーザーはテキスト入力を必要とせず、自由に音声でモデルと対話できます。

自然言語のテキスト応答を生成

音声分析

音声コンテンツ理解

ユーザーが音声とテキスト指示を提供し、モデルが分析して応答を生成します。

音声コンテンツを識別し、説明を生成

🚀 Qwen/Qwen2-Audio-7B-Instruct-FP8

Qwen2-Audioは、Qwenの新シリーズの大規模音声言語モデルです。このモデルは、様々な音声信号入力を受け取り、音声分析や音声指示に対する直接的なテキスト応答を行うことができます。2つの異なる音声対話モードを提供しています。

音声チャット: ユーザーはテキスト入力なしでQwen2-Audioと自由に音声対話を行うことができます。
音声分析: ユーザーは対話中に音声とテキスト指示を提供して分析を行うことができます。

Qwen2-Audio-7BとQwen2-Audio-7B-Instructをリリースしており、それぞれ事前学習モデルとチャットモデルです。

詳細については、ブログ、GitHub、およびレポートを参照してください。

🚀 クイックスタート

vllmでの起動方法

vllm serve mlinmg/Qwen-2-Audio-Instruct-dynamic-fp8

推論の使用方法

以下では、Qwen2-Audio-7B-Instructを使用した推論方法を示します。音声チャットと音声分析の両方のモードをサポートしています。対話にはChatML形式を使用しており、このデモではapply_chat_templateを使用する方法を示します。

基本的な使用法

音声チャット推論

音声チャットモードでは、ユーザーはテキスト入力なしでQwen2-Audioと自由に音声対話を行うことができます。

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
    ]},
    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(librosa.load(
                    BytesIO(urlopen(ele['audio_url']).read()), 
                    sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

音声分析推論

音声分析では、ユーザーは音声とテキスト指示の両方を提供して分析を行うことができます。

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'}, 
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()), 
                        sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

高度な使用法

バッチ推論

バッチ推論もサポートしています。

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation1 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},
        {"type": "text", "text": "What can you hear?"},
    ]}
]

conversation2 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]

conversations = [conversation1, conversation2]

text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]

audios = []
for conversation in conversations:
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audios.append(
                        librosa.load(
                            BytesIO(urlopen(ele['audio_url']).read()), 
                            sr=processor.feature_extractor.sampling_rate)[0]
                    )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs['input_ids'] = inputs['input_ids'].to("cuda")
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

📦 インストール

Qwen2-Audioのコードは最新のHugging face transformersに含まれています。ソースからビルドすることをおすすめします。以下のコマンドでインストールできます。

pip install git+https://github.com/huggingface/transformers

これを行わないと、以下のエラーが発生する可能性があります。

KeyError: 'qwen2-audio'

📄 ライセンス

このプロジェクトは、Apache-2.0ライセンスの下で公開されています。

📚 引用

本プロジェクトが役に立った場合は、以下のように引用してください。

@article{Qwen2-Audio,
  title={Qwen2-Audio Technical Report},
  author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo,  Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2407.10759},
  year={2024}
}

@article{Qwen-Audio,
  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2311.07919},
  year={2023}
}