开源SeamlessM4Tv2-Large语音编码器 - 支持跨多语言序列级音频分类

首页

Seamless M4t V2 Large Speech Encoder

由 WueNLP 开发

从SeamlessM4Tv2-Large中提取的语音编码器模块，擅长跨语言和多语言的序列级音频分类任务

音频分类

Transformers

支持多种语言#多语言语音编码 #音频分类 #跨语言处理

下载量 67

发布时间 : 11/18/2024

模型简介

该模型是一个多语言语音编码器，专门用于音频分类任务，支持超过100种语言。

模型特点

多语言支持

支持超过100种语言的语音编码和分类

音频分类

擅长跨语言和多语言的序列级音频分类任务

高效处理

优化用于处理16kHz音频波形

模型能力

音频特征提取

多语言音频分类

语音编码

使用案例

语音识别

多语言语音分类

对多种语言的语音进行分类

在SIB-Fleurs数据集上表现优异

语音处理

语音特征提取

从语音中提取有用的特征

🚀 SeamlessM4Tv2-Large语音编码器

本项目从 SeamlessM4Tv2-Large 中提取出语音编码器，该编码器在跨语言和多语言序列级音频分类任务中表现出色（相关结果可参考 SIB-Fleurs）。

所有荣誉归功于原始的 SeamlessM4Tv2-Large 团队。

🚀 快速开始

本项目可用于跨语言和多语言序列级音频分类任务，从 SeamlessM4Tv2-Large 中提取的语音编码器能为相关任务提供强大支持。

✨ 主要特性

多语言支持：支持众多语言，包括但不限于英语、中文、法语、德语等，可查看文档开头的语言列表。
音频分类能力：在跨语言和多语言序列级音频分类任务中表现出色。

📦 安装指南

文档未提及具体安装步骤，可根据 transformers 库的常规安装方式进行安装。

💻 使用示例

基础用法

# 最好在GPU上同时使用特征提取器和模型！
from datasets import load_dataset
from transformers import (
    AutoModel,
    AutoModelForAudioClassification,
    AutoFeatureExtractor,
)
import torch
import torchaudio

device = "cuda:0"

feature_extractor = AutoFeatureExtractor.from_pretrained(
    "WueNLP/seamless-m4t-v2-large-speech-encoder", trust_remote_code=True
)
model = AutoModel.from_pretrained(
    "WueNLP/seamless-m4t-v2-large-speech-encoder",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to(device)

audio, orig_freq = torchaudio.load(
    "https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav"
)
audio = torchaudio.functional.resample(
    audio, orig_freq=orig_freq, new_freq=16_000
)  # 必须是16 kHz的波形数组
# return_attention_mask=True用于批量处理
audio_inputs = feature_extractor(audio, return_attention_mask=True, return_tensors="pt", device=device)
audio_inputs = audio_inputs.to(device)
with torch.autocast(dtype=torch.bfloat16, device_type="cuda"):
    audio_hidden_states = model(**audio_inputs)[0].detach().cpu().numpy().squeeze()


# 实例化一个用于音频分类的模型
model = AutoModelForAudioClassification.from_pretrained(
    "WueNLP/seamless-m4t-v2-large-speech-encoder",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    # SIB-Fleurs有7个标签
    num_labels=7,
).to(device)
eng_Latn = load_dataset("wuenlp/sib-fleurs", "eng_Latn", split="train")
examples = [eng_Latn[i] for i in range(5)]
labels = torch.LongTensor([example["category"] for example in examples]).to(device)
batch = feature_extractor(
    # 这里的[0]索引是因为每个实例通常有多个话语，我们忽略其他的
    [example["audio"][0]["array"] for example in examples],
    sampling_rate=16000,
    device=device,
    return_attention_mask=True,
    return_tensors="pt",
).to(device)
batch["labels"] = labels
with torch.autocast(dtype=torch.bfloat16, device_type="cuda"):
    # 输出包括损失和对数几率
    outputs = model(**batch)

📄 许可证

本项目采用 cc-by-nc-4.0 许可证。

📚 详细文档

引用说明

如果您使用此模型，请引用原始的 SeamlessM4Tv2 论文。

@misc{communication2023seamlessmultilingualexpressivestreaming,
      title={Seamless: Multilingual Expressive and Streaming Speech Translation}, 
      author={Seamless Communication and Loïc Barrault and Yu-An Chung and Mariano Coria Meglioli and David Dale and Ning Dong and Mark Duppenthaler and Paul-Ambroise Duquenne and Brian Ellis and Hady Elsahar and Justin Haaheim and John Hoffman and Min-Jae Hwang and Hirofumi Inaguma and Christopher Klaiber and Ilia Kulikov and Pengwei Li and Daniel Licht and Jean Maillard and Ruslan Mavlyutov and Alice Rakotoarison and Kaushik Ram Sadagopan and Abinesh Ramakrishnan and Tuan Tran and Guillaume Wenzek and Yilin Yang and Ethan Ye and Ivan Evtimov and Pierre Fernandez and Cynthia Gao and Prangthip Hansanti and Elahe Kalbassi and Amanda Kallet and Artyom Kozhevnikov and Gabriel Mejia Gonzalez and Robin San Roman and Christophe Touret and Corinne Wong and Carleigh Wood and Bokai Yu and Pierre Andrews and Can Balioglu and Peng-Jen Chen and Marta R. Costa-jussà and Maha Elbayad and Hongyu Gong and Francisco Guzmán and Kevin Heffernan and Somya Jain and Justine Kao and Ann Lee and Xutai Ma and Alex Mourachko and Benjamin Peloquin and Juan Pino and Sravya Popuri and Christophe Ropers and Safiyyah Saleem and Holger Schwenk and Anna Sun and Paden Tomasello and Changhan Wang and Jeff Wang and Skyler Wang and Mary Williamson},
      year={2023},
      eprint={2312.05187},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2312.05187}, 
}

信息表格

属性	详情
支持语言	af、am、ar等众多语言（具体见文档开头语言列表）
标签	audio-to-audio、text-to-speech
多语言特性	多语言支持
任务类别	音频分类
库名称	transformers
模型名称	SeamlessM4Tv2-Large Speech Encoder