SeaLLMs-Audio-7B開源音頻語言模型 - 支持五語種，實現音頻分析與語音交互

首頁

Seallms Audio 7B

由SeaLLMs開發

SeaLLMs-Audio是面向東南亞的大規模音頻語言模型，支持印尼語、泰語、越南語、英語和中文五大語種，具備音頻分析、語音交互等能力。

音頻生成文本

Safetensors

支持多種語言開源協議:其他 #東南亞多語言音頻處理 #音文混合輸入 #語音任務一體化

下載量 539

發布時間 : 3/13/2025

模型概述

SeaLLMs-Audio是SeaLLMs（東南亞語言大模型家族）的多模態（音頻）擴展版本，作為首個支持多東南亞語言的大規模音頻語言模型（LALM），覆蓋印尼語、泰語、越南語及英語和中文。

模型特點

多語言支持

主要覆蓋印尼語、泰語、越南語、英語和中文五大語種

多模態輸入

支持純音頻、純文本及音文混合的靈活輸入形式

多任務處理

涵蓋音頻描述、語音識別、語音翻譯、情感識別、語音問答、語音摘要等分析任務

模型能力

音頻描述

語音識別

語音翻譯

情感識別

語音問答

語音摘要

事實問答

數學計算

使用案例

語音交互

多語言語音助手

支持東南亞多種語言的語音交互

提供流暢的語音對話體驗

音頻分析

語音內容轉錄

將東南亞語言的語音內容轉錄為文本

高準確率的語音識別

🚀 SeaLLMs-Audio：面向東南亞的大型音頻語言模型

SeaLLMs-Audio是SeaLLMs系列的多模態（音頻）擴展模型，支持多種東南亞語言，可處理多種音頻相關任務，為音頻語言處理帶來新的解決方案。

🚀 快速開始

我們的模型可在Hugging Face上獲取，你可以使用transformers庫或vllm庫輕鬆使用它。以下是一些入門示例。

使用`transformers`庫開始

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
import librosa
import os

model = Qwen2AudioForConditionalGeneration.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B", device_map="auto")
processor = AutoProcessor.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B")

def response_to_audio(conversation, model=None, processor=None):
    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios = []
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    if ele['audio_url'] != None:
                        audios.append(librosa.load(
                            ele['audio_url'], 
                            sr=processor.feature_extractor.sampling_rate)[0]
                        )
    if audios != []:
        inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True,sampling_rate=16000)
    else: 
        inputs = processor(text=text, return_tensors="pt", padding=True)
    inputs.input_ids = inputs.input_ids.to("cuda")
    inputs = {k: v.to("cuda") for k, v in inputs.items() if v is not None}
    generate_ids = model.generate(**inputs, max_new_tokens=2048, temperature = 0, do_sample=False)
    generate_ids = generate_ids[:, inputs["input_ids"].size(1):]
    response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return response

# 語音聊天
os.system(f"wget -O fact_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/fact_en.wav")
os.system(f"wget -O general_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/general_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "fact_en.wav"},
    ]},
    {"role": "assistant", "content": "The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "general_en.wav"},
    ]},
]

response = response_to_audio(conversation, model=model, processor=processor)
print(response)

# 音頻分析
os.system(f"wget -O ASR_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/ASR_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "ASR_en.wav"},
        {"type": "text", "text": "Please write down what is spoken in the audio file."},
    ]},
]

response = response_to_audio(conversation, model=model, processor=processor)
print(response)

使用`vllm`庫進行推理

from vllm import LLM, SamplingParams
import librosa, os
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B")
llm = LLM(
    model="SeaLLMs/SeaLLMs-Audio-7B", trust_remote_code=True, gpu_memory_utilization=0.5,  
    enforce_eager=True,  device = "cuda",
    limit_mm_per_prompt={"audio": 5},
)

def response_to_audio(conversation, model=None, processor=None, temperature = 0.1,repetition_penalty=1.1, top_p = 0.9,max_new_tokens = 4096):
    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios = []
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    if ele['audio_url'] != None:
                        audios.append(librosa.load(
                            ele['audio_url'], 
                            sr=processor.feature_extractor.sampling_rate)[0]
                        )

    sampling_params = SamplingParams(
        temperature=temperature, max_tokens=max_new_tokens, repetition_penalty=repetition_penalty, top_p=top_p, top_k=20,
        stop_token_ids=[],
    )

    input = {
            'prompt': text,
            'multi_modal_data': {
                'audio': [(audio, 16000) for audio in audios]
            }
            }

    output = model.generate([input], sampling_params=sampling_params)[0]
    response = output.outputs[0].text
    return response

# 語音聊天
os.system(f"wget -O fact_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/fact_en.wav")
os.system(f"wget -O general_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/general_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "fact_en.wav"},
    ]},
    {"role": "assistant", "content": "The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "general_en.wav"},
    ]},
]

response = response_to_audio(conversation, model=llm, processor=processor)
print(response)

# 音頻分析
os.system(f"wget -O ASR_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/ASR_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "ASR_en.wav"},
        {"type": "text", "text": "Please write down what is spoken in the audio file."},
    ]},
]

response = response_to_audio(conversation, model=llm, processor=processor)
print(response)

✨ 主要特性

多語言支持：該模型主要支持5種語言，包括 🇮🇩 印尼語、🇹🇭 泰語、🇻🇳 越南語、🇬🇧 英語和 🇨🇳 中文。
多模態輸入：模型支持靈活的輸入格式，如 僅音頻、僅文本以及音頻與文本結合。
多任務處理：模型支持多種任務，包括音頻分析任務，如音頻字幕生成、自動語音識別、語音到文本翻譯、語音情感識別、語音問答和語音摘要。此外，它還能處理語音聊天任務，包括回答事實性、數學和其他一般性問題。

📦 安裝指南

文檔未提供具體安裝步驟，可參考上述快速開始部分使用模型。

💻 使用示例

基礎用法

上述快速開始部分的代碼示例展示瞭如何使用transformers庫和vllm庫進行語音聊天和音頻分析，這是基礎的使用方式。

🔧 技術細節

SeaLLMs-Audio基於Qwen2-Audio-7B和Qwen2.5-7B-Instruct構建。我們用Qwen2.5-7B-Instruct替換了Qwen2-Audio-7B中的大語言模型（LLM）模塊。之後，我們在大規模音頻數據集上進行了全參數微調。該數據集包含158萬個用於多任務的對話，其中93%為單輪對話。這些任務大致可分為以下幾類：自動語音識別（ASR）、音頻字幕生成（AC）、語音到文本翻譯（S2TT）、問答（QA）、語音摘要（SS）、語音問答（SQA）、聊天、數學以及事實和混合任務（混合）。

數據在語言和任務上的分佈如下：

SeaLLMs-Audio訓練數據在語言和任務上的分佈

SeaLLMs-Audio訓練數據在語言上的分佈 SeaLLMs-Audio訓練數據在任務上的分佈

訓練數據集來自多個數據源，包括公共數據集和內部數據。公共數據集包括：gigaspeech、gigaspeech2、common voice、AudioCaps、VoiceAssistant-400K、YODAS2和Multitask-National-Speech-Corpus。我們感謝這些數據集的作者為社區做出的貢獻！

我們在數據集上對模型進行了1個輪次的訓練，在32個A800 GPU上完成訓練大約花費了6天時間。

📚 詳細文檔

由於缺乏用於評估東南亞音頻大語言模型的標準音頻基準，我們手動創建了一個名為SeaBench-Audio的基準。它包含九項任務：

音頻和文本輸入的任務：音頻字幕生成（AC）、自動語音識別（ASR）、語音到文本翻譯（S2TT）、語音情感識別（SER）、語音問答（SQA）和語音摘要（SS）。
僅音頻輸入的任務：事實性、數學和一般性任務。

我們為每種語言的每個任務手動標註了15個問題。為了進行評估，合格的母語人士對每個回覆進行了1到5分的評分，5分表示最高質量。

由於缺乏支持所有三種東南亞語言的音頻大語言模型，我們將SeaLLMs-Audio的性能與規模相近的相關音頻大語言模型進行了比較，包括：Qwen2-Audio-7B-Instruct（Qwen2-Audio）、MERaLiON-AudioLLM-Whisper-SEA-LION（MERaLiON）、llama3.1-typhoon2-audio-8b-instruct（typhoon2-audio）和DiVA-llama-3-v0-8b（DiVA）。所有這些音頻大語言模型都可以接受音頻和文本作為輸入。結果如下圖所示。

SeaLLMs-Audio與其他音頻大語言模型在SeaBench-Audio上的平均得分 SeaLLMs-Audio與其他音頻大語言模型的性能比較

結果表明，SeaLLMs-Audio在所有五種語言中都達到了最先進的性能，證明了它在支持東南亞音頻相關任務方面的有效性。

📄 許可證

本項目使用其他許可證，許可證名稱為seallms，具體許可證內容請參考LICENSE。

📋 引用信息

如果您覺得我們的項目有用，希望您能給我們的倉庫點個星，並按以下方式引用我們的工作。對應作者：張文軒 (wxzhang@sutd.edu.sg)

@misc{SeaLLMs-Audio,
    author = {Chaoqun Liu and Mahani Aljunied and Guizhen Chen and Hou Pong Chan and Weiwen Xu and Yu Rong and Wenxuan Zhang},
    title = {SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/DAMO-NLP-SG/SeaLLMs-Audio}},
}