wav2vec2-large-960h Open-source Speech Recognition Model - Free Deployment for High-precision Speech Transcription

Wav2vec2 Large 960h

Developed by facebook

Wav2Vec2 is a speech recognition model developed by Facebook. It learns speech representations from raw audio through self-supervised learning and is fine-tuned on the LibriSpeech dataset to achieve high-accuracy speech transcription.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Speech-to-Text #High-Accuracy Speech Recognition #Low-Resource Speech Processing

Downloads 77.59k

Release Time : 3/2/2022

Model Overview

This model is pre-trained and fine-tuned on 960 hours of LibriSpeech data sampled at 16kHz, suitable for English speech recognition tasks.

Model Features

Self-Supervised Learning

Learns speech representations from raw audio, reducing reliance on large amounts of labeled data.

High-Accuracy Transcription

Achieves a word error rate (WER) of 2.8/6.3 on the LibriSpeech test set.

Low-Resource Adaptation

Delivers high performance even with limited labeled data, making it suitable for resource-constrained scenarios.

Model Capabilities

English Speech Recognition

Audio Transcription

Speech Processing

Use Cases

Speech Transcription

Meeting Minutes

Automatically transcribes meeting recordings into text for easy archiving and retrieval.

High-accuracy transcription with a word error rate as low as 2.8.

Voice Assistants

Used in the speech recognition module of voice assistants to enhance interaction.

Supports real-time speech recognition with fast response times.

Education

Language Learning

Helps language learners practice pronunciation and listening with instant feedback.

High-accuracy recognition of pronunciation errors, improving learning efficiency.

🚀 Wav2Vec2-Large-960h

A large model pretrained and fine-tuned on 960 hours of Librispeech for speech recognition

This is a large model that has been pretrained and fine-tuned on 960 hours of Librispeech with 16kHz sampled speech audio. When using this model, ensure that your speech input is also sampled at 16kHz.

🔍 Details

Source: Facebook's Wav2Vec2
Paper: Paper
Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

📄 Abstract

We demonstrate for the first time that learning powerful representations from speech audio alone and then fine - tuning on transcribed speech can outperform the best semi - supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When reducing the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state - of - the - art on the 100 - hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre - training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This shows the feasibility of speech recognition with limited amounts of labeled data.

The original model can be found at https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.

🚀 Quick Start

💻 Usage Examples

Basic Usage

# Code examples remain unchanged
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage

# Evaluate the model on LibriSpeech's "clean" and "other" test data
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import soundfile as sf
import torch
from jiwer import wer


librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")

def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])

print("WER:", wer(result["text"], result["transcription"]))

📊 Evaluation Results

"clean"	"other"
2.8	6.3

📦 Additional Information

📄 License

This project is licensed under the Apache 2.0 license.

🔖 Tags

speech

📈 Datasets

librispeech_asr

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご