wav2vec2-base-10k-voxpopuli-ft-sl Open-source Automatic Speech Recognition Model

Wav2vec2 Base 10k Voxpopuli Ft Sl

Developed by facebook

Based on Facebook's Wav2Vec2 base model, pretrained on a 10K unlabeled subset of the VoxPopuli corpus and fine-tuned on Slovenian transcription data for automatic speech recognition.

Speech Recognition

Transformers

Other#Slovenian speech recognition #VoxPopuli pretraining #Multilingual support

Downloads 26

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition system optimized for Slovenian, capable of converting speech to text.

Model Features

Multilingual pretraining

Pretrained on the VoxPopuli multilingual corpus, enabling cross-language learning capabilities

Slovenian optimization

Specifically fine-tuned for Slovenian, improving recognition accuracy for this language

End-to-end model

Learns speech representations directly from raw audio, eliminating the need for manual feature extraction in traditional speech recognition pipelines

Model Capabilities

Speech recognition

Audio-to-text conversion

Slovenian language processing

Use Cases

Speech transcription

Automated meeting minutes

Automatically convert Slovenian meeting recordings into written transcripts

Voice assistant development

Provide speech recognition capabilities for Slovenian voice assistants

Accessibility technology

Real-time caption generation

Generate real-time captions for Slovenian video content

🚀 Wav2Vec2-Base-VoxPopuli-Finetuned

This is a fine - tuned base model of Facebook's Wav2Vec2. It was pre - trained on the 10K unlabeled subset of VoxPopuli corpus and fine - tuned on the transcribed data in Slovenian (refer to Table 1 of the paper for more information).

✨ Features

Model Type: A fine - tuned version of Facebook's Wav2Vec2 base model.
Training Data: Pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine - tuned on Slovenian transcribed data.

Property	Details
Model Type	A fine - tuned version of Facebook's Wav2Vec2 base model
Training Data	Pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine - tuned on Slovenian transcribed data

Paper: VoxPopuli: A Large - Scale Multilingual Speech Corpus for Representation Learning, Semi - Supervised Learning and Interpretation

Authors: Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux from Facebook AI

See the official website for more information, here

🚀 Quick Start

This model can be used for inference on a sample of the Common Voice dataset.

💻 Usage Examples

Basic Usage

#!/usr/bin/env python3
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torchaudio
import torch

# resample audio

# load model & processor
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-sl")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-sl")

# load dataset
ds = load_dataset("common_voice", "sl", split="validation[:1%]")

# common voice does not match target sampling rate
common_voice_sample_rate = 48000
target_sample_rate = 16000

resampler = torchaudio.transforms.Resample(common_voice_sample_rate, target_sample_rate)


# define mapping fn to read in sound file and resample
def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    speech = resampler(speech)
    batch["speech"] = speech[0]
    return batch


# load all audio files
ds = ds.map(map_to_array)

# run inference on the first 5 data samples
inputs = processor(ds[:5]["speech"], sampling_rate=target_sample_rate, return_tensors="pt", padding=True)

# inference
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, axis=-1)

print(processor.batch_decode(predicted_ids))

📄 License

This model is released under the CC - BY - NC - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご