wav2vec2-base-10k-voxpopuli-ft-pl Open-source Speech Recognition Model

Home

Wav2vec2 Base 10k Voxpopuli Ft Pl

Developed by facebook

Pre-trained on 10K unlabeled data from the VoxPopuli corpus and fine-tuned on Polish transcription data

Speech Recognition

Transformers

Other#Polish speech recognition #VoxPopuli fine-tuning #Multilingual pre-training

Downloads 203

Release Time : 3/2/2022

Model Overview

This model is the Polish version of Facebook's Wav2Vec2 base architecture, specifically optimized for Polish speech recognition tasks, suitable for raw audio-to-text conversion.

Model Features

Multilingual pre-training

Pre-trained on the VoxPopuli multilingual corpus, with cross-lingual representation capabilities

Polish optimization

Fine-tuned specifically for Polish speech characteristics to improve recognition accuracy

End-to-end recognition

Directly generates text output from raw audio input without intermediate feature extraction

Model Capabilities

Polish speech recognition

Audio to text

Automatic speech transcription

Use Cases

Speech transcription

Automated meeting minutes

Automatically convert Polish meeting recordings into text transcripts

Voice assistants

Provide voice interaction capabilities for Polish-speaking users

Accessibility technology

Real-time caption generation

Provide real-time captions for audio content in Polish for hearing-impaired users

🚀 Wav2Vec2-Base-VoxPopuli-Finetuned

This is a fine - tuned model based on Facebook's Wav2Vec2. The base model was pre - trained on the 10K unlabeled subset of the VoxPopuli corpus and then fine - tuned on the transcribed Polish data (for more information, refer to Table 1 of the paper).

🚀 Quick Start

✨ Features

Based on the Wav2Vec2 architecture, suitable for automatic speech recognition tasks.
Fine - tuned on Polish transcribed data, optimized for the Polish language.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The following code demonstrates how to use the model for inference on a sample of the Common Voice dataset:

#!/usr/bin/env python3
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torchaudio
import torch

# resample audio

# load model & processor
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-pl")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-pl")

# load dataset
ds = load_dataset("common_voice", "pl", split="validation[:1%]")

# common voice does not match target sampling rate
common_voice_sample_rate = 48000
target_sample_rate = 16000

resampler = torchaudio.transforms.Resample(common_voice_sample_rate, target_sample_rate)


# define mapping fn to read in sound file and resample
def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    speech = resampler(speech)
    batch["speech"] = speech[0]
    return batch


# load all audio files
ds = ds.map(map_to_array)

# run inference on the first 5 data samples
inputs = processor(ds[:5]["speech"], sampling_rate=target_sample_rate, return_tensors="pt", padding=True)

# inference
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, axis=-1)

print(processor.batch_decode(predicted_ids))

📚 Documentation

Paper: VoxPopuli: A Large - Scale Multilingual Speech Corpus for Representation Learning, Semi - Supervised Learning and Interpretation

Authors: Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux from Facebook AI

See the official website for more information, here

📄 License

This model is released under the cc - by - nc - 4.0 license.

Property	Details
Model Type	Wav2Vec2 - Base - VoxPopuli - Finetuned
Training Data	Transcribed Polish data from VoxPopuli corpus
License	cc - by - nc - 4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご