W2V-BERT 2.0 Open-source Speech Encoder - Supports 143 Languages, Pretrained on Massive Amounts of Unlabeled Audio

W2v Bert 2.0

Developed by facebook

A speech encoder based on the Conformer architecture, pretrained on 4.5 million hours of unlabeled audio data, supporting over 143 languages

Speech Recognition

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multilingual speech encoding #Conformer architecture #Large-scale pretraining

Downloads 477.05k

Release Time : 12/19/2023

Model Overview

W2v-BERT 2.0 is a powerful speech encoder that adopts the Conformer architecture and is pretrained on large-scale multilingual audio data, serving as a foundational model for speech processing tasks.

Model Features

Large-scale multilingual pretraining

Pretrained on 4.5 million hours of unlabeled audio data, covering over 143 languages

Advanced architecture

Adopts the Conformer architecture, combining the strengths of CNN and Transformer

Flexible applications

Can be fine-tuned as a foundational model for various speech processing tasks

Model Capabilities

Speech feature extraction

Multilingual speech processing

Audio embedding generation

Use Cases

Speech recognition

Automatic Speech Recognition (ASR)

Achieves high-accuracy speech-to-text conversion through model fine-tuning

Supports speech recognition in multiple languages

Audio analysis

Audio classification

Utilizes extracted audio features for classification tasks

🚀 W2v-BERT 2.0 speech encoder

We are open-sourcing our Conformer-based W2v-BERT 2.0 speech encoder as described in Section 3.2.1 of the paper, which is at the core of our Seamless models. This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.

✨ Features

Multilingual Support: Supports a wide range of languages including af, am, ar, etc.
Pre - trained on Large - scale Data: Trained on 4.5M hours of unlabeled audio data.
Integrated with 🤗 Transformers: Supported by the 🤗 Transformers library.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoFeatureExtractor, Wav2Vec2BertModel
import torch
from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

processor = AutoProcessor.from_pretrained("facebook/w2v-bert-2.0")
model = Wav2Vec2BertModel.from_pretrained("facebook/w2v-bert-2.0")

# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

Advanced Usage

import torch

from fairseq2.data.audio import AudioDecoder, WaveformToFbankConverter
from fairseq2.memory import MemoryBlock
from fairseq2.nn.padding import get_seqs_and_padding_mask
from pathlib import Path
from seamless_communication.models.conformer_shaw import load_conformer_shaw_model


audio_wav_path, device, dtype = ...
audio_decoder = AudioDecoder(dtype=torch.float32, device=device)
fbank_converter = WaveformToFbankConverter(
    num_mel_bins=80,
    waveform_scale=2**15,
    channel_last=True,
    standardize=True,
    device=device,
    dtype=dtype,
)
collater = Collater(pad_value=1)

model = load_conformer_shaw_model("conformer_shaw", device=device, dtype=dtype)
model.eval()

with Path(audio_wav_path).open("rb") as fb:
    block = MemoryBlock(fb.read())

decoded_audio = audio_decoder(block)
src = collater(fbank_converter(decoded_audio))["fbank"]
seqs, padding_mask = get_seqs_and_padding_mask(src)

with torch.inference_mode():
  seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
  seqs, padding_mask = model.encoder(seqs, padding_mask)

📚 Documentation

To learn more about the model use, refer to the following resources:

🔧 Technical Details

The README does not provide specific technical details, so this section is skipped.

📄 License

The model is released under the MIT license.

Additional Information

Property	Details
Model Name	W2v-BERT 2.0
#params	600M
checkpoint	checkpoint
Supported Languages	af, am, ar, as, az, be, bn, bs, bg, ca, cs, zh, cy, da, de, el, en, et, fi, fr, or, om, ga, gl, gu, ha, he, hi, hr, hu, hy, ig, id, is, it, jv, ja, kn, ka, kk, mn, km, ky, ko, lo, ln, lt, lb, lg, lv, ml, mr, mk, mt, mi, my, nl, nb, ne, ny, oc, pa, ps, fa, pl, pt, ro, ru, sk, sl, sn, sd, so, es, sr, sv, sw, ta, te, tg, tl, th, tr, uk, ur, uz, vi, wo, xh, yo, ms, zu, ary, arz, yue, kea
Inference	false

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご