wav2vec2-large-xlsr-53-swedish Open Source Model - Supports 16kHz, Accurately Identifies Swedish Speech

Wav2vec2 Large Xlsr 53 Swedish

Developed by KBLab

A Swedish automatic speech recognition model fine-tuned based on the facebook/wav2vec2-large-xlsr-53 framework, supporting 16kHz sampled audio input

Speech Recognition OtherOpen Source License:Apache-2.0 #Swedish speech recognition #Low word error rate (WER 14.3%)#XLSR-53 fine-tuning

Downloads 30.51k

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model specifically optimized for Swedish, based on the large-scale XLSR-53 architecture and fine-tuned on the Swedish NST dictation corpus and Common Voice dataset.

Model Features

High-performance Swedish recognition

Achieves a 14.3% word error rate and 4.93% character error rate on the Common Voice Swedish test set

Multi-stage training

Optimized through three stages: pre-training, incremental training, and final fine-tuning

No language model required

Can be used directly without additional language model support

Model Capabilities

Swedish speech recognition

Audio-to-text conversion

Speech processing

Use Cases

Speech transcription

Broadcast content transcription

Automatically transcribe Swedish radio programs into text

Voice command recognition

Recognize Swedish voice commands

Speech assistive technology

Accessibility applications

Provide real-time captioning services for the hearing impaired

🚀 Wav2Vec2-Large-XLSR-53-Swedish

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 in Swedish, which can be used for automatic speech recognition tasks.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 in Swedish, using the NST Swedish Dictation. When using this model, ensure that your speech input is sampled at 16kHz.

⚠️ Important Note

We recommend using our newer model wav2vec2-large-voxrex-swedish for the best performance.

✨ Features

Datasets: The model is trained on datasets such as common_voice and KTH/nst.
Metrics: It is evaluated using metrics like wer (Word Error Rate) and cer (Character Error Rate).
License: The model is under the apache - 2.0 license.

Property	Details
Model Type	Fine - tuned XLSR Wav2Vec2 for Swedish
Training Data	NST Swedish Dictation, Common Voice

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-xlsr-53-swedish")
model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-xlsr-53-swedish")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()

    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "sv-SE", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-xlsr-53-swedish")
model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-xlsr-53-swedish")
model.to("cuda")

chars_to_ignore_regex = '[,?.!\\-;:"“]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()

    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)

    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
print("CER: {:2f}".format(100 * wer.compute(predictions=[" ".join(list(entry)) for entry in result["pred_strings"]], references=[" ".join(list(entry)) for entry in result["sentence"]])))

Evaluation Results

WER: 14.298610% CER: 4.925294%

🔧 Technical Details

Training Process

First, the XLSR model was further pre - trained for 50 epochs with a corpus consisting of 1000 hours of spoken Swedish from various radio stations. Secondly, NST Swedish Dictation was used for fine - tuning, along with Common Voice. Lastly, only the Common Voice dataset was used for final fine - tuning. The Fairseq scripts were used.

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご