Wav2Vec2-large-960h-lv60-self Open-source Speech Model - Free Experience of SOTA Speech Processing Effects

Wav2vec2 Large 960h Lv60 Self

Developed by facebook

The Wav2Vec2 large model developed by Facebook, pre-trained and fine-tuned on 960 hours of Libri-Light and Librispeech audio data, using self-training objectives, achieving SOTA results on the LibriSpeech test set.

Speech Recognition EnglishOpen Source License:Apache-2.0 #High-precision speech recognition #Self-supervised pre-training #Low-resource adaptation

Downloads 56.00k

Release Time : 3/2/2022

Model Overview

A pre-trained model for automatic speech recognition (ASR) that learns speech representations from raw audio through self-supervised learning, then achieves high-precision speech-to-text conversion via fine-tuning.

Model Features

Self-supervised Pre-training

Learns speech representations in latent space through contrastive learning objectives, reducing reliance on labeled data

High-precision Recognition

Achieves SOTA results of 1.9/3.9 WER (clean/other) on the LibriSpeech test set

Low-resource Adaptation

Requires only a small amount of labeled data for fine-tuning, outperforming traditional methods with just 1 hour of labeled data

Model Capabilities

English speech recognition

16kHz audio processing

End-to-end speech-to-text

Use Cases

Speech Transcription

Automated Meeting Minutes

Automatically converts English meeting recordings into text transcripts

High-accuracy transcription, reducing manual documentation costs

Podcast Subtitle Generation

Automatically generates subtitles for English podcast content

Supports batch processing with accuracy rates exceeding 96%

Assistive Technology

Hearing Impairment Assistance

Real-time speech-to-text conversion for hearing-impaired individuals

Low-latency real-time conversion

🚀 Wav2Vec2-Large-960h-Lv60 + Self-Training

A large model pretrained and fine-tuned on 960 hours of Libri-Light and Librispeech for automatic speech recognition.

This is a large model that has been pretrained and fine-tuned on 960 hours of Libri-Light and Librispeech speech audio sampled at 16kHz. It was trained with the Self-Training objective. When using this model, ensure that your speech input is also sampled at 16kHz.

Model Information

Property	Details
Model Type	wav2vec2-large-960h-lv60
Training Data	Libri-Light and Librispeech (960 hours, 16kHz sampled speech audio)
License	apache-2.0
Tags	speech, audio, automatic-speech-recognition, hf-asr-leaderboard

Results

Task	Dataset	Test WER
Automatic Speech Recognition	LibriSpeech (clean)	1.9
Automatic Speech Recognition	LibriSpeech (other)	3.9

Paper and Authors

Paper
Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

Abstract

The paper shows for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.

The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.

🚀 Quick Start

To get started with this model, follow the steps below.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Evaluation

This code snippet shows how to evaluate facebook/wav2vec2-large-960h-lv60-self on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

def map_to_pred(batch):
    inputs = processor(batch["audio"]["array"], return_tensors="pt", padding="longest")
    input_values = inputs.input_values.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))

Result (WER):

"clean"	"other"
1.9	3.9

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご