wav2vec2-large-tedlium Open-source Speech Recognition Model - Free and Precise English Speech-to-Text Conversion

Wav2vec2 Large Tedlium

Developed by sanchit-gandhi

Wav2Vec2 large speech recognition model fine-tuned on the TEDLIUM corpus, supporting English speech-to-text conversion

Speech Recognition EnglishOpen Source License:Apache-2.0 #TED Talk Transcription #High-Accuracy Speech Recognition #English Speech Processing

Downloads 58

Release Time : 7/4/2022

Model Overview

This model is a large Wav2Vec2 model fine-tuned on the TEDLIUM corpus, specifically designed for English speech recognition tasks.

Model Features

High-Accuracy Speech Recognition

Achieves 8.2% Word Error Rate (WER) on the TEDLIUM test set

Large-Scale Pretraining

Pretrained on 60,000 hours of LibriVox audio

Domain Adaptation

Fine-tuned on 452 hours of TED Talk data

Model Capabilities

English Speech Recognition

Long Audio Processing

16kHz Sampling Rate Audio Processing

Use Cases

Speech Transcription

TED Talk Transcription

Convert TED Talk audio into text

8.4% WER (development set)

Educational Content Transcription

Convert educational lectures and speeches into text

🚀 Wav2Vec2-Large-Tedlium

A fine-tuned Wav2Vec2 large model on the TEDLIUM corpus for speech recognition.

This model is initialized with Facebook's Wav2Vec2 large LV-60k checkpoint, which is pre-trained on 60,000 hours of audiobooks from the LibriVox project. It is fine-tuned on 452 hours of TED talks from the TEDLIUM corpus (Release 3). When using the model, ensure that your speech input is sampled at 16Khz.

The model achieves a word error rate (WER) of 8.4% on the dev set and 8.2% on the test set. The Training logs document the training and evaluation progress over 50k steps of fine-tuning.

For more information on how this model was fine-tuned, see this notebook.

🚀 Quick Start

Prerequisites

Ensure your speech input is sampled at 16Khz.

Transcribing Audio Files

The model can be used as a standalone acoustic model to transcribe audio files.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")

# load dummy dataset
ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation")

# process audio inputs
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print("Target: ", ds["text"][0])
print("Transcription: ", transcription[0])

Evaluation

The following code snippet shows how to evaluate Wav2Vec2-Large-Tedlium on the TEDLIUM test data.

Advanced Usage

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test")
model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch
result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
print("WER:", wer(result["text"], result["transcription"]))

📄 License

This project is licensed under the Apache-2.0 license.

Property	Details
Model Type	Wav2Vec2 large model fine-tuned on the TEDLIUM corpus
Training Data	452h of TED talks from the TEDLIUM corpus (Release 3)
Tags	speech
Datasets	LIUM/tedlium

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご