wav2vec2-large-voxrex-swedish Open-Source Swedish Speech Recognition Model

Wav2vec2 Large Voxrex Swedish

Developed by KBLab

A Swedish automatic speech recognition model fine-tuned based on the VoxRex large model, supporting 16kHz sampling rate audio input

Speech Recognition

Transformers

Other#Swedish speech recognition #Low Word Error Rate (WER)#Broadcast speech adaptation

Downloads 101.28k

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system optimized for Swedish, based on Facebook's Wav2vec 2.0 architecture, and fine-tuned on Swedish broadcast, NST, and Common Voice datasets.

Model Features

High-performance Swedish recognition

Achieves 2.5% WER on NST+Common Voice test set and 8.49% WER on Common Voice test set

Supports language model enhancement

Using a 4-gram language model reduces WER from 8.49% to 7.37%

Multi-dataset training

Combined training on Swedish broadcast, NST, and Common Voice datasets

Model Capabilities

Swedish speech recognition

16kHz audio processing

Direct use without language model

Use Cases

Speech-to-text

Broadcast content transcription

Automatically convert Swedish broadcast content into text

Excellent performance on broadcast datasets

Voice assistant

Provide speech recognition capability for Swedish voice assistants

🚀 Wav2vec 2.0 large VoxRex Swedish (C)

This is a fine - tuned version of KBs VoxRex large model. It uses Swedish radio broadcasts, NST and Common Voice data to enhance the performance of automatic speech recognition for the Swedish language.

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

Update 2022 - 01 - 10: Updated to VoxRex - C version. Update 2022 - 05 - 16: The related paper is here.

✨ Features

Finetuned based on KBs VoxRex large model.
Utilizes Swedish radio broadcasts, NST, and Common Voice data.
Achieves low Word Error Rate (WER) in speech recognition tasks for the Swedish language.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Performance

Comparison

~~*Chart shows performance without the additional 20k steps of Common Voice fine - tuning~~

Evalutation without a language model gives the following results:

WER for NST + Common Voice test set (2% of total sentences) is 2.5%.
WER for Common Voice test set is 8.49% directly and 7.37% with a 4 - gram language model.

Training

This model has been fine - tuned for 120000 updates on NST + CommonVoice and then for an additional 20000 updates on CommonVoice only. The additional fine - tuning on CommonVoice hurts performance on the NST+CommonVoice test set somewhat and, unsurprisingly, improves it on the CommonVoice test set. It seems to perform generally better though [citation needed].

WER during training

🔧 Technical Details

No specific technical details meeting the requirement (>50 words) are provided in the original document, so this section is skipped.

📄 License

This model is licensed under cc0 - 1.0.

📋 Information Table

Property	Details
Model Type	Wav2vec 2.0 large VoxRex Swedish (C)
Training Datasets	common_voice, NST_Swedish_ASR_Database, P4
Evaluation Metrics	wer
Tags	audio, automatic - speech - recognition, speech, hf - asr - leaderboard
License	cc0 - 1.0

📖 Citation

https://arxiv.org/abs/2205.03026

@misc{malmsten2022hearing,
      title={Hearing voices at the National Library -- a speech corpus and acoustic model for the Swedish language}, 
      author={Martin Malmsten and Chris Haffenden and Love Börjeson},
      year={2022},
      eprint={2205.03026},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご