wav2vec2-base-10k-voxpopuli-ft-en Open-source Model - Free Deployment for Accurate English Speech Recognition

Wav2vec2 Base 10k Voxpopuli Ft En

Developed by facebook

A Wav2Vec2 base model pre-trained on a 10K unlabeled subset of the VoxPopuli corpus and fine-tuned on English transcription data, suitable for English speech recognition tasks.

Speech Recognition

Transformers

English#English speech recognition #VoxPopuli fine-tuning #Unsupervised pre-training

Downloads 40

Release Time : 3/2/2022

Model Overview

This model is Facebook's Wav2Vec2 base model, pre-trained on the VoxPopuli corpus and fine-tuned with English transcription data, primarily used for English automatic speech recognition (ASR) tasks.

Model Features

VoxPopuli Pre-training

Pre-trained on a 10K unlabeled subset of the large-scale multilingual VoxPopuli speech corpus

English Transcription Fine-tuning

Fine-tuned on English transcription data to optimize English speech recognition performance

End-to-End Speech Recognition

Generates text output directly from raw audio input without intermediate feature extraction steps

Model Capabilities

English speech recognition

Audio transcription

Automatic speech-to-text

Use Cases

Speech Transcription

Meeting Minutes

Automatically transcribe English meeting recordings into text records

Podcast Transcription

Convert English podcast content into searchable text format

Assistive Technology

Speech-to-Text Tool

Provide real-time speech-to-text services for the hearing impaired

🚀 Wav2Vec2-Base-VoxPopuli-Finetuned

This project is based on Facebook's Wav2Vec2. The base model is pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine-tuned on the transcribed data in English. For more information, refer to Table 1 of the paper.

✨ Features

The model is based on the pre - trained Wav2Vec2 base model, leveraging the knowledge learned from the large - scale unlabeled data of the VoxPopuli corpus.
Fine - tuned on English transcribed data, making it suitable for English automatic speech recognition tasks.

📄 License

The model is released under the CC - BY - NC - 4.0 license.

💻 Usage Examples

Basic Usage

#!/usr/bin/env python3
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torchaudio
import torch

# resample audio

# load model & processor
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-en")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-en")

# load dataset
ds = load_dataset("common_voice", "en", split="validation[:1%]")

# common voice does not match target sampling rate
common_voice_sample_rate = 48000
target_sample_rate = 16000

resampler = torchaudio.transforms.Resample(common_voice_sample_rate, target_sample_rate)


# define mapping fn to read in sound file and resample
def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    speech = resampler(speech)
    batch["speech"] = speech[0]
    return batch


# load all audio files
ds = ds.map(map_to_array)

# run inference on the first 5 data samples
inputs = processor(ds[:5]["speech"], sampling_rate=target_sample_rate, return_tensors="pt", padding=True)

# inference
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, axis=-1)

print(processor.batch_decode(predicted_ids))

📚 Documentation

Paper: VoxPopuli: A Large - Scale Multilingual Speech Corpus for Representation Learning, Semi - Supervised Learning and Interpretation

Authors: Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux from Facebook AI

See the official website for more information, here

Property	Details
Model Type	Wav2Vec2 - Base - VoxPopuli - Finetuned
Training Data	Pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine - tuned on English transcribed data

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご