Seamless M4t V2 Large Speech Encoder

Developed by WueNLP

Speech encoder module extracted from SeamlessM4Tv2-Large, excelling in cross-language and multilingual sequence-level audio classification tasks

Audio Classification

Transformers

Supports Multiple Languages#Multilingual speech encoding #Audio classification #Cross-language processing

Downloads 67

Release Time : 11/18/2024

Model Overview

This model is a multilingual speech encoder specifically designed for audio classification tasks, supporting over 100 languages.

Model Features

Multilingual support

Supports speech encoding and classification for over 100 languages

Audio classification

Excels in cross-language and multilingual sequence-level audio classification tasks

Efficient processing

Optimized for processing 16kHz audio waveforms

Model Capabilities

Audio feature extraction

Multilingual audio classification

Speech encoding

Use Cases

Speech recognition

Multilingual speech classification

Classifying speech in multiple languages

Performs excellently on the SIB-Fleurs dataset

Speech processing

Speech feature extraction

Extracting useful features from speech

license: cc-by-nc-4.0 language:

af
am
ar
as
az
be
bn
bs
bg
ca
cs
zh
cy
da
de
el
en
et
fi
fr
or
om
ga
gl
gu
ha
he
hi
hr
hu
hy
ig
id
is
it
jv
ja
kn
ka
kk
mn
km
ky
ko
lo
ln
lt
lb
lg
lv
ml
mr
mk
mt
mi
my
nl
nb
ne
ny
oc
pa
ps
fa
pl
pt
ro
ru
sk
sl
sn
sd
so
es
sr
sv
sw
ta
te
tg
tl
th
tr
uk
ur
uz
vi
wo
xh
yo
ms
zu
ary
arz
yue
kea tags:
- audio-to-audio
- text-to-speech multilinguality:
- multilingual task_categories:
- audio-classification library_name: transformers pretty_name: SeamlessM4Tv2-Large Speech Encoder

SeamlessM4Tv2-Large Speech Encoder

This repository carves out the speech encoder from SeamlessM4Tv2-Large, which performs strongly on cross- and multilingual sequence-level audio classification tasks (cf. results on SIB-Fleurs available here).

All credits go to the original SeamlessM4Tv2-Large Team.

Example Usage

You can use both AutoModel and AutoModelForAudioClassification (or AutoModelForSequenceClassification, if you prefer) with this repository:

# best to use both feature extractor and model with GPU!
from datasets import load_dataset
from transformers import (
    AutoModel,
    AutoModelForAudioClassification,
    AutoFeatureExtractor,
)
import torch
import torchaudio

device = "cuda:0"

feature_extractor = AutoFeatureExtractor.from_pretrained(
    "WueNLP/seamless-m4t-v2-large-speech-encoder", trust_remote_code=True
)
model = AutoModel.from_pretrained(
    "WueNLP/seamless-m4t-v2-large-speech-encoder",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to(device)

audio, orig_freq = torchaudio.load(
    "https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav"
)
audio = torchaudio.functional.resample(
    audio, orig_freq=orig_freq, new_freq=16_000
)  # must be a 16 kHz waveform array
# return_attention_mask=True for batching
audio_inputs = feature_extractor(audio, return_attention_mask=True, return_tensors="pt", device=device)
audio_inputs = audio_inputs.to(device)
with torch.autocast(dtype=torch.bfloat16, device_type="cuda"):
    audio_hidden_states = model(**audio_inputs)[0].detach().cpu().numpy().squeeze()


# instantiate a model for AudioClassification
model = AutoModelForAudioClassification.from_pretrained(
    "WueNLP/seamless-m4t-v2-large-speech-encoder",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    # SIB-Fleurs has 7 labels
    num_labels=7,
).to(device)
eng_Latn = load_dataset("wuenlp/sib-fleurs", "eng_Latn", split="train")
examples = [eng_Latn[i] for i in range(5)]
labels = torch.LongTensor([example["category"] for example in examples]).to(device)
batch = feature_extractor(
    # [0] indexing here since there typically are multiple utterances per instance, we just ignore those
    [example["audio"][0]["array"] for example in examples],
    sampling_rate=16000,
    device=device,
    return_attention_mask=True,
    return_tensors="pt",
).to(device)
batch["labels"] = labels
with torch.autocast(dtype=torch.bfloat16, device_type="cuda"):
    # outputs comprises loss & logits
    outputs = model(**batch)

Citation

Should you be using this model, please cite the original SeamlessM4Tv2 paper.

@misc{communication2023seamlessmultilingualexpressivestreaming,
      title={Seamless: Multilingual Expressive and Streaming Speech Translation}, 
      author={Seamless Communication and Loïc Barrault and Yu-An Chung and Mariano Coria Meglioli and David Dale and Ning Dong and Mark Duppenthaler and Paul-Ambroise Duquenne and Brian Ellis and Hady Elsahar and Justin Haaheim and John Hoffman and Min-Jae Hwang and Hirofumi Inaguma and Christopher Klaiber and Ilia Kulikov and Pengwei Li and Daniel Licht and Jean Maillard and Ruslan Mavlyutov and Alice Rakotoarison and Kaushik Ram Sadagopan and Abinesh Ramakrishnan and Tuan Tran and Guillaume Wenzek and Yilin Yang and Ethan Ye and Ivan Evtimov and Pierre Fernandez and Cynthia Gao and Prangthip Hansanti and Elahe Kalbassi and Amanda Kallet and Artyom Kozhevnikov and Gabriel Mejia Gonzalez and Robin San Roman and Christophe Touret and Corinne Wong and Carleigh Wood and Bokai Yu and Pierre Andrews and Can Balioglu and Peng-Jen Chen and Marta R. Costa-jussà and Maha Elbayad and Hongyu Gong and Francisco Guzmán and Kevin Heffernan and Somya Jain and Justine Kao and Ann Lee and Xutai Ma and Alex Mourachko and Benjamin Peloquin and Juan Pino and Sravya Popuri and Christophe Ropers and Safiyyah Saleem and Holger Schwenk and Anna Sun and Paden Tomasello and Changhan Wang and Jeff Wang and Skyler Wang and Mary Williamson},
      year={2023},
      eprint={2312.05187},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2312.05187}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご