wav2vec2-base-voxpopuli-sv Swedish Speech Recognition Model - Open-source Model for Precise Swedish Speech Recognition

Wav2vec2 Base Voxpopuli Sv Swedish

Developed by KBLab

A Swedish speech recognition model fine-tuned using NST and Common Voice data, based on Facebook's VoxPopuli-sv base model.

Speech Recognition

Transformers

#Swedish speech recognition #High-precision WER #No language model dependency

Downloads 38

Release Time : 3/2/2022

Model Overview

This model is a Wav2vec 2.0 model for Swedish automatic speech recognition (ASR), fine-tuned on the NST Swedish ASR database and Common Voice dataset.

Model Features

High-performance Swedish recognition

Achieves 5.62% WER on the NST test set and 19.15% WER on the Common Voice test set.

Multi-dataset training

Fine-tuned using the NST Swedish ASR database and Common Voice dataset.

No language model required

Can be used directly without additional language model support.

Model Capabilities

Swedish speech recognition

16kHz audio processing

Use Cases

Speech-to-text

Swedish speech transcription

Convert Swedish speech content into text

Achieves 5.62% word error rate on professional datasets

Voice assistant

Speech recognition component for Swedish voice assistant applications

🚀 Wav2vec 2.0 base-voxpopuli-sv-swedish

This is a finetuned version of Facebook's VoxPopuli-sv base model, leveraging NST and Common Voice data. Without a language model, the Word Error Rate (WER) is 5.62% for the NST + Common Voice test set (2% of total sentences) and 19.15% for the Common Voice test set.

🚀 Quick Start

When using this model, ensure that your speech input is sampled at 16kHz.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]").
processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-base-voxpopuli-sv-swedish")
model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-base-voxpopuli-sv-swedish")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Dataset Information

Property	Details
Datasets	common_voice, NST Swedish ASR Database
Metrics	wer
Tags	audio, automatic-speech-recognition, speech, voxpopuli

Model Evaluation Results

The model named "Wav2vec 2.0 base VoxPopuli-sv swedish" has the following evaluation results:

For the NST Swedish ASR Database, the Test WER is 5.619804368919309.
For the Common Voice dataset (sv - SE), the Test WER is 19.145252414798616.

📄 License

This model is released under the cc - by - nc - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご