Wav2vec2-large-100h-lv60-self Open-source Speech Recognition Model - Free Deployment for Accurate Speech Recognition

Wav2vec2 Large 100h Lv60 Self

Developed by Splend1dchan

Wav2Vec2-Large-100h-Lv60 is a large model pre-trained and fine-tuned on 100 hours of Libri-Light and Librispeech speech data, trained with self-training objectives, suitable for speech recognition tasks with 16kHz sampling rate.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Self-supervised speech recognition #Low-resource speech processing #High-accuracy speech transcription

Downloads 17

Release Time : 4/12/2022

Model Overview

This model is an automatic speech recognition (ASR) model that learns speech representations from raw audio through self-supervised learning and achieves high-performance speech recognition with limited labeled data.

Model Features

Self-supervised Learning

Trained with self-training objectives, capable of learning effective speech representations with limited labeled data

Efficient Speech Recognition

Achieves low word error rate (WER) on the Librispeech dataset

Low-resource Adaptation

Can achieve acceptable recognition results even with only 10 minutes of labeled data

Model Capabilities

Speech recognition

Audio feature extraction

English speech transcription

Use Cases

Speech-to-text

Meeting Minutes

Automatically transcribe English meeting recordings into text records

Podcast Transcription

Automatically convert English podcast content into text transcripts

Voice Assistant

Voice Command Recognition

Recognize and understand English voice commands

🚀 Wav2Vec2-Large-100h-Lv60 + Self-Training

This is a direct state_dict transfer from fairseq to huggingface, the weights are identical.

This large model is pretrained and fine - tuned on 100 hours of Libri - Light and Librispeech with 16kHz sampled speech audio. It was trained with the Self - Training objective. When using the model, ensure that your speech input is also sampled at 16Khz.

Key Information

Property	Details
Datasets	librispeech_asr
Tags	speech, audio, automatic - speech - recognition, hf - asr - leaderboard
License	apache - 2.0
Model Name	wav2vec2 - large - 100h - lv60
Task	Automatic Speech Recognition
Dataset for Evaluation	Librispeech (clean)
Metrics	Test WER (value: None)

Authors

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

Abstract

They show for the first time that learning powerful representations from speech audio alone followed by fine - tuning on transcribed speech can outperform the best semi - supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre - training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.

🚀 Quick Start

To use this model, make sure your speech input is sampled at 16Khz.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("Splend1dchan/wav2vec2-large-100h-lv60-self")
model = Wav2Vec2ForCTC.from_pretrained("Splend1dchan/wav2vec2-large-100h-lv60-self")

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage - Evaluation

This code snippet shows how to evaluate facebook's Splend1dchan/wav2vec2-large-100h-lv60-self on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
model = Wav2Vec2ForCTC.from_pretrained("Splend1dchan/wav2vec2-large-100h-lv60-self").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("Splend1dchan/wav2vec2-large-100h-lv60-self")
def map_to_pred(batch):
    inputs = processor(batch["audio"]["array"], return_tensors="pt", padding="longest")
    input_values = inputs.input_values.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")
    
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch
result = librispeech_eval.map(map_to_pred, remove_columns=["speech"])
print("WER:", wer(result["text"], result["transcription"]))

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご