Wav2vec2 Swedish Common Voice Open-Source Speech Recognition Model

Wav2vec2 Swedish Common Voice

Developed by birgermoell

This is a speech recognition model fine-tuned on the Swedish Common Voice dataset based on the facebook/wav2vec2-large-xlsr-53 model, with a training data volume of 402MB.

Speech Recognition OtherOpen Source License:Apache-2.0 #Swedish speech recognition #XLSR fine-tuning #Low-resource optimization

Downloads 24

Release Time : 3/2/2022

Model Overview

This model is used for Swedish automatic speech recognition (ASR) tasks and supports voice input with a 16kHz sampling rate.

Model Features

Swedish Optimization

Specifically fine-tuned for Swedish, trained on the Common Voice Swedish dataset

Based on XLSR Model

Built upon the powerful wav2vec2-large-xlsr-53 base model

Lightweight Training

Fine-tuned using only 402MB of training data

Model Capabilities

Swedish speech recognition

16kHz audio processing

Use Cases

Speech-to-Text

Swedish Speech Transcription

Convert Swedish speech to text

Achieves a WER of 36.91% on the Common Voice test set

🚀 Wav2Vec2-Large-XLSR-53-Swedish

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 in Swedish, leveraging the Common Voice dataset. The training data size is 402 MB. Ensure your speech input is sampled at 16kHz when using this model.

📋 Model Information

Property	Details
Language	Swedish
Datasets	Common Voice
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
License	apache - 2.0
Model Name	XLSR Wav2Vec2 Swedish by Birger Moell
Task	Speech Recognition (automatic - speech - recognition)
Dataset Used	Common Voice sv - SE
Test WER	36.91

🚀 Quick Start

This is a Swedish fine - tuned version of the facebook/wav2vec2-large-xlsr-53 model, trained on the Common Voice dataset. The training data size is 402 MB. When using this model, ensure that your speech input is sampled at 16kHz.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("birgermoell/wav2vec2-swedish-common-voice")
model = Wav2Vec2ForCTC.from_pretrained("birgermoell/wav2vec2-swedish-common-voice")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📊 Evaluation

The model can be evaluated on the Swedish test data of Common Voice as follows:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "sv-SE", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("birgermoell/wav2vec2-swedish-common-voice")
model = Wav2Vec2ForCTC.from_pretrained("birgermoell/wav2vec2-swedish-common-voice")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
		logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 36.91 %

🔨 Training

The Common Voice train and validation datasets were used for training. The training script can be found here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご