SeaLLMs-Audio-7B开源音频语言模型 - 支持五语种，实现音频分析与语音交互

首页

Seallms Audio 7B

由 SeaLLMs 开发

SeaLLMs-Audio是面向东南亚的大规模音频语言模型，支持印尼语、泰语、越南语、英语和中文五大语种，具备音频分析、语音交互等能力。

音频生成文本

Safetensors

支持多种语言开源协议:其他 #东南亚多语言音频处理 #音文混合输入 #语音任务一体化

下载量 539

发布时间 : 3/13/2025

模型简介

SeaLLMs-Audio是SeaLLMs（东南亚语言大模型家族）的多模态（音频）扩展版本，作为首个支持多东南亚语言的大规模音频语言模型（LALM），覆盖印尼语、泰语、越南语及英语和中文。

模型特点

多语言支持

主要覆盖印尼语、泰语、越南语、英语和中文五大语种

多模态输入

支持纯音频、纯文本及音文混合的灵活输入形式

多任务处理

涵盖音频描述、语音识别、语音翻译、情感识别、语音问答、语音摘要等分析任务

模型能力

音频描述

语音识别

语音翻译

情感识别

语音问答

语音摘要

事实问答

数学计算

使用案例

语音交互

多语言语音助手

支持东南亚多种语言的语音交互

提供流畅的语音对话体验

音频分析

语音内容转录

将东南亚语言的语音内容转录为文本

高准确率的语音识别

🚀 SeaLLMs-Audio：面向东南亚的大型音频语言模型

SeaLLMs-Audio是SeaLLMs系列的多模态（音频）扩展模型，支持多种东南亚语言，可处理多种音频相关任务，为音频语言处理带来新的解决方案。

🚀 快速开始

我们的模型可在Hugging Face上获取，你可以使用transformers库或vllm库轻松使用它。以下是一些入门示例。

使用`transformers`库开始

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
import librosa
import os

model = Qwen2AudioForConditionalGeneration.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B", device_map="auto")
processor = AutoProcessor.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B")

def response_to_audio(conversation, model=None, processor=None):
    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios = []
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    if ele['audio_url'] != None:
                        audios.append(librosa.load(
                            ele['audio_url'], 
                            sr=processor.feature_extractor.sampling_rate)[0]
                        )
    if audios != []:
        inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True,sampling_rate=16000)
    else: 
        inputs = processor(text=text, return_tensors="pt", padding=True)
    inputs.input_ids = inputs.input_ids.to("cuda")
    inputs = {k: v.to("cuda") for k, v in inputs.items() if v is not None}
    generate_ids = model.generate(**inputs, max_new_tokens=2048, temperature = 0, do_sample=False)
    generate_ids = generate_ids[:, inputs["input_ids"].size(1):]
    response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return response

# 语音聊天
os.system(f"wget -O fact_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/fact_en.wav")
os.system(f"wget -O general_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/general_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "fact_en.wav"},
    ]},
    {"role": "assistant", "content": "The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "general_en.wav"},
    ]},
]

response = response_to_audio(conversation, model=model, processor=processor)
print(response)

# 音频分析
os.system(f"wget -O ASR_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/ASR_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "ASR_en.wav"},
        {"type": "text", "text": "Please write down what is spoken in the audio file."},
    ]},
]

response = response_to_audio(conversation, model=model, processor=processor)
print(response)

使用`vllm`库进行推理

from vllm import LLM, SamplingParams
import librosa, os
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B")
llm = LLM(
    model="SeaLLMs/SeaLLMs-Audio-7B", trust_remote_code=True, gpu_memory_utilization=0.5,  
    enforce_eager=True,  device = "cuda",
    limit_mm_per_prompt={"audio": 5},
)

def response_to_audio(conversation, model=None, processor=None, temperature = 0.1,repetition_penalty=1.1, top_p = 0.9,max_new_tokens = 4096):
    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios = []
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    if ele['audio_url'] != None:
                        audios.append(librosa.load(
                            ele['audio_url'], 
                            sr=processor.feature_extractor.sampling_rate)[0]
                        )

    sampling_params = SamplingParams(
        temperature=temperature, max_tokens=max_new_tokens, repetition_penalty=repetition_penalty, top_p=top_p, top_k=20,
        stop_token_ids=[],
    )

    input = {
            'prompt': text,
            'multi_modal_data': {
                'audio': [(audio, 16000) for audio in audios]
            }
            }

    output = model.generate([input], sampling_params=sampling_params)[0]
    response = output.outputs[0].text
    return response

# 语音聊天
os.system(f"wget -O fact_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/fact_en.wav")
os.system(f"wget -O general_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/general_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "fact_en.wav"},
    ]},
    {"role": "assistant", "content": "The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "general_en.wav"},
    ]},
]

response = response_to_audio(conversation, model=llm, processor=processor)
print(response)

# 音频分析
os.system(f"wget -O ASR_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/ASR_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "ASR_en.wav"},
        {"type": "text", "text": "Please write down what is spoken in the audio file."},
    ]},
]

response = response_to_audio(conversation, model=llm, processor=processor)
print(response)

✨ 主要特性

多语言支持：该模型主要支持5种语言，包括 🇮🇩 印尼语、🇹🇭 泰语、🇻🇳 越南语、🇬🇧 英语和 🇨🇳 中文。
多模态输入：模型支持灵活的输入格式，如 仅音频、仅文本以及音频与文本结合。
多任务处理：模型支持多种任务，包括音频分析任务，如音频字幕生成、自动语音识别、语音到文本翻译、语音情感识别、语音问答和语音摘要。此外，它还能处理语音聊天任务，包括回答事实性、数学和其他一般性问题。

📦 安装指南

文档未提供具体安装步骤，可参考上述快速开始部分使用模型。

💻 使用示例

基础用法

上述快速开始部分的代码示例展示了如何使用transformers库和vllm库进行语音聊天和音频分析，这是基础的使用方式。

🔧 技术细节

SeaLLMs-Audio基于Qwen2-Audio-7B和Qwen2.5-7B-Instruct构建。我们用Qwen2.5-7B-Instruct替换了Qwen2-Audio-7B中的大语言模型（LLM）模块。之后，我们在大规模音频数据集上进行了全参数微调。该数据集包含158万个用于多任务的对话，其中93%为单轮对话。这些任务大致可分为以下几类：自动语音识别（ASR）、音频字幕生成（AC）、语音到文本翻译（S2TT）、问答（QA）、语音摘要（SS）、语音问答（SQA）、聊天、数学以及事实和混合任务（混合）。

数据在语言和任务上的分布如下：

SeaLLMs-Audio训练数据在语言和任务上的分布

SeaLLMs-Audio训练数据在语言上的分布 SeaLLMs-Audio训练数据在任务上的分布

训练数据集来自多个数据源，包括公共数据集和内部数据。公共数据集包括：gigaspeech、gigaspeech2、common voice、AudioCaps、VoiceAssistant-400K、YODAS2和Multitask-National-Speech-Corpus。我们感谢这些数据集的作者为社区做出的贡献！

我们在数据集上对模型进行了1个轮次的训练，在32个A800 GPU上完成训练大约花费了6天时间。

📚 详细文档

由于缺乏用于评估东南亚音频大语言模型的标准音频基准，我们手动创建了一个名为SeaBench-Audio的基准。它包含九项任务：

音频和文本输入的任务：音频字幕生成（AC）、自动语音识别（ASR）、语音到文本翻译（S2TT）、语音情感识别（SER）、语音问答（SQA）和语音摘要（SS）。
仅音频输入的任务：事实性、数学和一般性任务。

我们为每种语言的每个任务手动标注了15个问题。为了进行评估，合格的母语人士对每个回复进行了1到5分的评分，5分表示最高质量。

由于缺乏支持所有三种东南亚语言的音频大语言模型，我们将SeaLLMs-Audio的性能与规模相近的相关音频大语言模型进行了比较，包括：Qwen2-Audio-7B-Instruct（Qwen2-Audio）、MERaLiON-AudioLLM-Whisper-SEA-LION（MERaLiON）、llama3.1-typhoon2-audio-8b-instruct（typhoon2-audio）和DiVA-llama-3-v0-8b（DiVA）。所有这些音频大语言模型都可以接受音频和文本作为输入。结果如下图所示。

SeaLLMs-Audio与其他音频大语言模型在SeaBench-Audio上的平均得分 SeaLLMs-Audio与其他音频大语言模型的性能比较

结果表明，SeaLLMs-Audio在所有五种语言中都达到了最先进的性能，证明了它在支持东南亚音频相关任务方面的有效性。

📄 许可证

本项目使用其他许可证，许可证名称为seallms，具体许可证内容请参考LICENSE。

📋 引用信息

如果您觉得我们的项目有用，希望您能给我们的仓库点个星，并按以下方式引用我们的工作。对应作者：张文轩 (wxzhang@sutd.edu.sg)

@misc{SeaLLMs-Audio,
    author = {Chaoqun Liu and Mahani Aljunied and Guizhen Chen and Hou Pong Chan and Weiwen Xu and Yu Rong and Wenxuan Zhang},
    title = {SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/DAMO-NLP-SG/SeaLLMs-Audio}},
}