wav2vec2-base-10k-voxpopuli-ft-fi Open Source Model - Accurately Achieve Automatic Speech Recognition for Finnish

Wav2vec2 Base 10k Voxpopuli Ft Fi

Developed by facebook

An automatic speech recognition model based on Facebook's Wav2Vec2 base model, pre-trained on a 10K unlabeled subset of the VoxPopuli corpus and fine-tuned on Finnish transcription data.

Speech Recognition

Transformers

Other#Finnish speech recognition #Multilingual pre-training #Low-resource optimization

Downloads 24

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system for Finnish, capable of converting Finnish speech into text.

Model Features

Based on VoxPopuli Corpus

Pre-trained using the large-scale multilingual VoxPopuli speech corpus, ensuring robust speech understanding capabilities.

Optimized for Finnish

Specifically fine-tuned for Finnish, improving recognition accuracy for Finnish speech.

End-to-End Speech Recognition

Directly generates text output from raw audio input, simplifying the speech recognition process.

Model Capabilities

Finnish speech recognition

Audio to text

Speech transcription

Use Cases

Speech Transcription

Automated Meeting Minutes

Automatically convert Finnish meeting recordings into text transcripts

Voice Assistants

Provide speech recognition capabilities for Finnish voice assistants

Accessibility Technology

Real-time Captioning

Generate real-time captions for Finnish video content

🚀 Wav2Vec2-Base-VoxPopuli-Finetuned

This is a fine - tuned model based on Facebook's Wav2Vec2. The base model was pre - trained on the 10K unlabeled subset of the VoxPopuli corpus and then fine - tuned on the Finnish transcribed data (refer to Table 1 of the paper for more details).

✨ Features

The model is designed for audio processing and automatic speech recognition.
It leverages the power of the Wav2Vec2 architecture and the rich data from the VoxPopuli corpus.

📚 Documentation

Paper

VoxPopuli: A Large - Scale Multilingual Speech Corpus for Representation Learning, Semi - Supervised Learning and Interpretation

Authors

Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux from Facebook AI

More Information

See the official website here for additional details.

💻 Usage Examples

Basic Usage

The following code demonstrates how to use the model for inference on a sample of the Common Voice dataset

#!/usr/bin/env python3
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torchaudio
import torch

# resample audio

# load model & processor
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-fi")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-fi")

# load dataset
ds = load_dataset("common_voice", "fi", split="validation[:1%]")

# common voice does not match target sampling rate
common_voice_sample_rate = 48000
target_sample_rate = 16000

resampler = torchaudio.transforms.Resample(common_voice_sample_rate, target_sample_rate)


# define mapping fn to read in sound file and resample
def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    speech = resampler(speech)
    batch["speech"] = speech[0]
    return batch


# load all audio files
ds = ds.map(map_to_array)

# run inference on the first 5 data samples
inputs = processor(ds[:5]["speech"], sampling_rate=target_sample_rate, return_tensors="pt", padding=True)

# inference
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, axis=-1)

print(processor.batch_decode(predicted_ids))

📄 License

This model is released under the CC - BY - NC - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご