The open-source speech recognition model wav2vec2-2-bart-large-tedlium enables free conversion of sequential speech to text.

Wav2vec2 2 Bart Large Tedlium

Developed by sanchit-gandhi

A sequence-to-sequence automatic speech recognition model trained on the TEDLIUM corpus, combining the Wav2Vec2 speech encoder and the Bart text decoder

Speech Recognition English#TED Speech Transcription #Low Word Error Rate #Speech Encoder - Text Decoder

Downloads 111

Release Time : 6/29/2022

Model Overview

This model is used for English speech recognition tasks. It adopts a hybrid architecture with Wav2Vec2 as the speech encoder and Bart as the text decoder, and performs excellently on the TED speech dataset

Model Features

Hybrid Architecture

Combining the advantages of the Wav2Vec2 speech encoder and the Bart text decoder to achieve efficient speech recognition

High Performance

Achieved a Word Error Rate (WER) of 6.4% on the TEDLIUM test set, showing excellent performance

Pretrained Initialization

The encoder and decoder are initialized with the pretrained weights of Wav2Vec2 LV-60k and Bart large respectively

Model Capabilities

English Speech Recognition

Long Audio Processing

High-quality Transcription

Use Cases

Meeting Minutes

TED Speech Transcription

Automatically convert TED speech audio into a written transcript

Word Error Rate of 6.4% on the test set

Education

Lecture Recording Transcription

Convert academic lecture recordings into text for notes or subtitles

🚀 Wav2Vec2-2-Bart-Large-Tedlium

This model is a sequence-2-sequence (seq2seq) model for automatic speech recognition, trained on the TEDLIUM corpus, combining a speech encoder and a text decoder.

✨ Features

This model is a sequence-2-sequence (seq2seq) model trained on the TEDLIUM corpus (release 3).
It combines a speech encoder with a text decoder to perform automatic speech recognition. The encoder weights are initialised with the Wav2Vec2 LV-60k checkpoint from @facebook. The decoder weights are initialised with the Bart large checkpoint from @facebook.
When using the model, make sure that your speech input is sampled at 16Khz.
The model achieves a word error rate (WER) of 9.0% on the dev set and 6.4% on the test set. Training logs document the training and evaluation progress over 50k steps of fine-tuning.

Property	Details
Language	en
Tags	automatic-speech-recognition
Datasets	LIUM/tedlium
License	cc-by-4.0
Dev WER	9.0
Test WER	6.4

🚀 Quick Start

This model can be used for automatic speech recognition. When using it, ensure that the speech input is sampled at 16Khz.

💻 Usage Examples

Basic Usage

 from transformers import AutoProcessor, SpeechEncoderDecoderModel
 from datasets import load_dataset
 import torch
 
 # load model and processor
 processor = AutoProcessor.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
 model = SpeechEncoderDecoderModel.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
     
 # load dummy dataset
 ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation")
 
 # process audio inputs
 input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
 
 # run inference (greedy search)
 generated = model.generate(input_values)
 
 # decode
 decoded = processor.batch_decode(generated, skip_special_tokens=True)
 print("Target: ", ds["text"][0])
 print("Transcription: ", decoded[0])

Advanced Usage

from datasets import load_dataset
from transformers import AutoProcessor, SpeechEncoderDecoderModel
import torch
from jiwer import wer

tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test")

def filter_ds(text):
    return text != "ignore_time_segment_in_scoring"

# remove samples ignored from scoring
tedlium_eval = tedlium_eval.map(filter_ds, input_columns=["text"])

model = SpeechEncoderDecoderModel.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium").to("cuda")
processor = AutoProcessor.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")

gen_kwargs = {
        "max_length": 200,
        "num_beams": 5,
        "length_penalty": 1.2
        }

def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        generated = model.generate(input_values.to("cuda"), **gen_kwargs)
    decoded = processor.batch_decode(generated, skip_special_tokens=True)
    batch["transcription"] = decoded[0]
    return batch

result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
print("WER:", wer(result["text"], result["transcription"]))

📄 License

This model is released under the cc-by-4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご