Open-source model wav2vec2-large-xls-r-1b-Swedish - Free deployment for Swedish speech-to-text conversion

Wav2vec2 Large Xls R 1b Swedish

Developed by kingabzpro

This model is an automatic speech recognition model fine-tuned on the Common Voice Swedish dataset based on facebook/wav2vec2-xls-r-1b, supporting Swedish speech-to-text tasks.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Swedish speech recognition #Low word error rate #Multi-scenario robustness

Downloads 844

Release Time : 3/2/2022

Model Overview

An automatic speech recognition model optimized for Swedish, based on the wav2vec2-xls-r-1b architecture, fine-tuned on the Common Voice 8.0 dataset, supporting high-precision Swedish speech recognition.

Model Features

High-performance Swedish Recognition

Achieves a word error rate (WER) of 14.04% and a character error rate (CER) of 4.86% on the Common Voice Swedish test set.

Fine-tuned on Large Model

Fine-tuned on the 1-billion-parameter wav2vec2-xls-r-1b model, featuring powerful speech feature extraction capabilities.

Supports Language Model Integration

Can be combined with a language model to further improve recognition accuracy, reducing WER by approximately 4% compared to no language model.

Model Capabilities

Swedish speech recognition

Speech-to-text

Long audio processing (supports chunk processing)

Use Cases

Speech Transcription

Swedish Speech Content Transcription

Convert Swedish speech content into text format

Achieves 14.04% WER on the Common Voice test set

Voice Assistants

Swedish Voice Command Recognition

Used for command recognition in Swedish voice assistant systems

Achieves 29.69% WER on the Robust Speech Events dataset

🚀 wav2vec2-large-xls-r-1b-Swedish

This model is a fine - tuned version of facebook/wav2vec2-xls-r-1b on the common_voice dataset. It's designed for automatic speech recognition, achieving excellent results on Swedish speech recognition tasks.

📚 Documentation

Model Information

Property	Details
Model Type	wav2vec2-large-xls-r-1b-Swedish
Base Model	facebook/wav2vec2-xls-r-1b
Datasets	mozilla - foundation/common_voice_8_0
Metrics	wer, cer
License	apache - 2.0
Tags	automatic - speech - recognition, robust - speech - event, hf - asr - leaderboard

Evaluation Results

This model achieves the following results on the evaluation set:

Without LM

Loss: 0.3370
Wer: 18.44
Cer: 5.75

With LM

Loss: 0.3370
Wer: 14.04
Cer: 4.86

Evaluation Commands

To evaluate on mozilla - foundation/common_voice_8_0 with split test

python eval.py --model_id kingabzpro/wav2vec2-large-xls-r-1b-Swedish --dataset mozilla-foundation/common_voice_8_0 --config sv-SE --split test

To evaluate on speech - recognition - community - v2/dev_data

python eval.py --model_id kingabzpro/wav2vec2-large-xls-r-1b-Swedish --dataset speech-recognition-community-v2/dev_data --config sv --split validation --chunk_length_s 5.0 --stride_length_s 1.0

Inference With LM

import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F
model_id = "kingabzpro/wav2vec2-large-xls-r-1b-Swedish"
sample_iter = iter(load_dataset("mozilla-foundation/common_voice_8_0", "sv-SE", split="test", streaming=True, use_auth_token=True))
sample = next(sample_iter)
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()
model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
input_values = processor(resampled_audio, return_tensors="pt").input_values
with torch.no_grad():
    logits = model(input_values).logits
transcription = processor.batch_decode(logits.numpy()).text

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 7.5e - 05
train_batch_size: 64
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 256
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 50
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer	Cer
3.1562	11.11	500	0.4830	0.3729	0.1169
0.5655	22.22	1000	0.3553	0.2381	0.0743
0.3376	33.33	1500	0.3359	0.2179	0.0696
0.2419	44.44	2000	0.3232	0.1844	0.0575

Framework versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.2.dev0
Tokenizers 0.11.0

Model Index

Name: wav2vec2-large-xls-r-1b-Swedish Results:
- Task: Type: automatic - speech - recognition Name: Speech Recognition Dataset: Name: Common Voice sv - SE Type: mozilla - foundation/common_voice_8_0 Args: sv - SE Metrics:
  - Type: wer Value: 14.04 Name: Test WER With LM
  - Type: cer Value: 4.86 Name: Test CER With LM
- Task: Type: automatic - speech - recognition Name: Automatic Speech Recognition Dataset: Name: Robust Speech Event - Dev Data Type: speech - recognition - community - v2/dev_data Args: sv Metrics:
  - Type: wer Value: 29.69 Name: Test WER
  - Type: cer Value: 12.59 Name: Test CER

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご