オープンソースSeamlessM4Tv2-Large音声エンコーダ - 複数言語間のシーケンスレベルの音声分類をサポート

ホーム

Seamless M4t V2 Large Speech Encoder

WueNLPによって開発

SeamlessM4Tv2-Largeから抽出された音声エンコーダモジュールで、クロスランゲージおよび多言語のシーケンスレベルのオーディオ分類タスクに優れています

音声分類

Transformers

複数言語対応#多言語音声エンコーディング #オーディオ分類 #クロスランゲージ処理

ダウンロード数 67

リリース時間 : 11/18/2024

モデル概要

このモデルは多言語音声エンコーダで、100以上の言語をサポートするオーディオ分類タスク専用です。

モデル特徴

多言語サポート

100以上の言語の音声エンコーディングと分類をサポート

オーディオ分類

クロスランゲージおよび多言語のシーケンスレベルのオーディオ分類タスクに優れています

効率的な処理

16kHzオーディオ波形の処理に最適化されています

モデル能力

音声特徴抽出

多言語オーディオ分類

音声エンコーディング

使用事例

音声認識

多言語音声分類

複数の言語の音声を分類

SIB-Fleursデータセットで優れた性能を発揮

音声処理

音声特徴抽出

音声から有用な特徴を抽出

🚀 SeamlessM4Tv2-Large音声エンコーダ

このリポジトリは、SeamlessM4Tv2-Large から音声エンコーダを切り出したものです。このエンコーダは、クロス言語および多言語のシーケンスレベルの音声分類タスクで優れた性能を発揮します（こちらにあるSIB - Fleursの結果を参照）。

すべての功績は、元のSeamlessM4Tv2 - Largeチームに帰されます。

🚀 クイックスタート

このリポジトリでは、AutoModel と AutoModelForAudioClassification（または AutoModelForSequenceClassification）の両方を使用できます。

💻 使用例

基本的な使用法

# best to use both feature extractor and model with GPU!
from datasets import load_dataset
from transformers import (
    AutoModel,
    AutoModelForAudioClassification,
    AutoFeatureExtractor,
)
import torch
import torchaudio

device = "cuda:0"

feature_extractor = AutoFeatureExtractor.from_pretrained(
    "WueNLP/seamless-m4t-v2-large-speech-encoder", trust_remote_code=True
)
model = AutoModel.from_pretrained(
    "WueNLP/seamless-m4t-v2-large-speech-encoder",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to(device)

audio, orig_freq = torchaudio.load(
    "https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav"
)
audio = torchaudio.functional.resample(
    audio, orig_freq=orig_freq, new_freq=16_000
)  # must be a 16 kHz waveform array
# return_attention_mask=True for batching
audio_inputs = feature_extractor(audio, return_attention_mask=True, return_tensors="pt", device=device)
audio_inputs = audio_inputs.to(device)
with torch.autocast(dtype=torch.bfloat16, device_type="cuda"):
    audio_hidden_states = model(**audio_inputs)[0].detach().cpu().numpy().squeeze()


# instantiate a model for AudioClassification
model = AutoModelForAudioClassification.from_pretrained(
    "WueNLP/seamless-m4t-v2-large-speech-encoder",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    # SIB-Fleurs has 7 labels
    num_labels=7,
).to(device)
eng_Latn = load_dataset("wuenlp/sib-fleurs", "eng_Latn", split="train")
examples = [eng_Latn[i] for i in range(5)]
labels = torch.LongTensor([example["category"] for example in examples]).to(device)
batch = feature_extractor(
    # [0] indexing here since there typically are multiple utterances per instance, we just ignore those
    [example["audio"][0]["array"] for example in examples],
    sampling_rate=16000,
    device=device,
    return_attention_mask=True,
    return_tensors="pt",
).to(device)
batch["labels"] = labels
with torch.autocast(dtype=torch.bfloat16, device_type="cuda"):
    # outputs comprises loss & logits
    outputs = model(**batch)

📚 詳細ドキュメント

このモデルを使用する場合は、元のSeamlessM4Tv2論文を引用してください。

@misc{communication2023seamlessmultilingualexpressivestreaming,
      title={Seamless: Multilingual Expressive and Streaming Speech Translation}, 
      author={Seamless Communication and Loïc Barrault and Yu-An Chung and Mariano Coria Meglioli and David Dale and Ning Dong and Mark Duppenthaler and Paul-Ambroise Duquenne and Brian Ellis and Hady Elsahar and Justin Haaheim and John Hoffman and Min-Jae Hwang and Hirofumi Inaguma and Christopher Klaiber and Ilia Kulikov and Pengwei Li and Daniel Licht and Jean Maillard and Ruslan Mavlyutov and Alice Rakotoarison and Kaushik Ram Sadagopan and Abinesh Ramakrishnan and Tuan Tran and Guillaume Wenzek and Yilin Yang and Ethan Ye and Ivan Evtimov and Pierre Fernandez and Cynthia Gao and Prangthip Hansanti and Elahe Kalbassi and Amanda Kallet and Artyom Kozhevnikov and Gabriel Mejia Gonzalez and Robin San Roman and Christophe Touret and Corinne Wong and Carleigh Wood and Bokai Yu and Pierre Andrews and Can Balioglu and Peng-Jen Chen and Marta R. Costa-jussà and Maha Elbayad and Hongyu Gong and Francisco Guzmán and Kevin Heffernan and Somya Jain and Justine Kao and Ann Lee and Xutai Ma and Alex Mourachko and Benjamin Peloquin and Juan Pino and Sravya Popuri and Christophe Ropers and Safiyyah Saleem and Holger Schwenk and Anna Sun and Paden Tomasello and Changhan Wang and Jeff Wang and Skyler Wang and Mary Williamson},
      year={2023},
      eprint={2312.05187},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2312.05187}, 
}

📄 ライセンス

このモデルはCC - BY - NC - 4.0ライセンスの下で提供されています。

属性	详情
モデルタイプ	音声エンコーダ
サポート言語	af、am、ar、as、az、be、bn、bs、bg、ca、cs、zh、cy、da、de、el、en、et、fi、fr、or、om、ga、gl、gu、ha、he、hi、hr、hu、hy、ig、id、is、it、jv、ja、kn、ka、kk、mn、km、ky、ko、lo、ln、lt、lb、lg、lv、ml、mr、mk、mt、mi、my、nl、nb、ne、ny、oc、pa、ps、fa、pl、pt、ro、ru、sk、sl、sn、sd、so、es、sr、sv、sw、ta、te、tg、tl、th、tr、uk、ur、uz、vi、wo、xh、yo、ms、zu、ary、arz、yue、kea
タグ	音声から音声、テキストから音声
多言語性	多言語対応
タスクカテゴリ	音声分類
ライブラリ名	transformers
表示名	SeamlessM4Tv2 - Large Speech Encoder