wav2vec2-large-xls-r-300m-bg Open-source Speech Recognition Model

Wav2vec2 Large Xls R 300m Bg

Developed by anuragshas

An automatic speech recognition model fine-tuned on the Common Voice 8 Bulgarian dataset based on facebook/wav2vec2-xls-r-300m

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Bulgarian speech recognition #Low word error rate #Multi-scenario adaptation

Downloads 1,469

Release Time : 3/2/2022

Model Overview

This is an optimized automatic speech recognition (ASR) model for Bulgarian, based on the XLS-R-300M architecture and fine-tuned on the Mozilla Common Voice 8 dataset.

Model Features

Multi-dataset Evaluation

Comprehensively evaluated on Common Voice 8 and Robust Speech Challenge datasets

High Performance

Achieved 21.195% WER and 4.786% CER on the Common Voice 8 test set

Optimized Training

Underwent 50 rounds of carefully tuned training to gradually reduce loss and error rates

Model Capabilities

Bulgarian speech recognition

Audio-to-text conversion

Long audio processing (supports chunk processing)

Use Cases

Speech Transcription

Voice Memo Transcription

Convert Bulgarian voice memos into searchable text

Approximately 80% accuracy (WER 21.195%)

Voice Assistant

Provide speech recognition capabilities for Bulgarian voice assistants

Speech Analysis

Speech Content Analysis

Analyze Bulgarian speech content to extract key information

🚀 XLS-R-300M - Bulgarian

This model is a fine - tuned version of [facebook/wav2vec2 - xls - r - 300m](https://huggingface.co/facebook/wav2vec2 - xls - r - 300m) on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - BG dataset. It's designed for automatic speech recognition, offering a practical solution for transcribing Bulgarian speech.

🚀 Quick Start

Evaluation Commands

To evaluate on mozilla - foundation/common_voice_8_0 with split test

python eval.py --model_id anuragshas/wav2vec2-large-xls-r-300m-bg --dataset mozilla-foundation/common_voice_8_0 --config bg --split test

To evaluate on speech - recognition - community - v2/dev_data

python eval.py --model_id anuragshas/wav2vec2-large-xls-r-300m-bg --dataset speech-recognition-community-v2/dev_data --config bg --split validation --chunk_length_s 5.0 --stride_length_s 1.0

Inference With LM

import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F
model_id = "anuragshas/wav2vec2-large-xls-r-300m-bg"
sample_iter = iter(load_dataset("mozilla-foundation/common_voice_8_0", "bg", split="test", streaming=True, use_auth_token=True))
sample = next(sample_iter)
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()
model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
input_values = processor(resampled_audio, return_tensors="pt").input_values
with torch.no_grad():
    logits = model(input_values).logits
transcription = processor.batch_decode(logits.numpy()).text
# => "и надутият му ката блоонкурем взе да се събира"

✨ Features

Fine - Tuned Model: Based on [facebook/wav2vec2 - xls - r - 300m](https://huggingface.co/facebook/wav2vec2 - xls - r - 300m), fine - tuned on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - BG dataset for better performance in Bulgarian speech recognition.
Multiple Evaluation Metrics: Evaluated on multiple datasets with metrics like WER (Word Error Rate) and CER (Character Error Rate) to measure performance accurately.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

📚 Documentation

Model Performance

This model achieves the following results on the evaluation set:

Loss: 0.2473
Wer: 0.3002

Model Index

Property	Details
Model Name	XLS - R - 300M - Bulgarian
Task	Automatic Speech Recognition
Datasets	mozilla - foundation/common_voice_8_0, speech - recognition - community - v2/dev_data, speech - recognition - community - v2/eval_data
Metrics	WER, CER

Training and Evaluation

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 7.5e - 05
train_batch_size: 32
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 50.0
mixed_precision_training: Native AMP

Training Results

Training Loss	Epoch	Step	Validation Loss	Wer
3.1589	3.48	400	3.0830	1.0
2.8921	6.96	800	2.6605	0.9982
1.3049	10.43	1200	0.5069	0.5707
1.1349	13.91	1600	0.4159	0.5041
1.0686	17.39	2000	0.3815	0.4746
0.999	20.87	2400	0.3541	0.4343
0.945	24.35	2800	0.3266	0.4132
0.9058	27.83	3200	0.2969	0.3771
0.8672	31.3	3600	0.2802	0.3553
0.8313	34.78	4000	0.2662	0.3380
0.8068	38.26	4400	0.2528	0.3181
0.7796	41.74	4800	0.2537	0.3073
0.7621	45.22	5200	0.2503	0.3036
0.7611	48.7	5600	0.2477	0.2991

Framework Versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.2.dev0
Tokenizers 0.11.0

Evaluation Results

Dataset	Split	WER (Without LM)	WER (With LM)
mozilla - foundation/common_voice_8_0	test	30.07	21.195

🔧 Technical Details

The model is a fine - tuned version of [facebook/wav2vec2 - xls - r - 300m](https://huggingface.co/facebook/wav2vec2 - xls - r - 300m). The fine - tuning process involves adjusting the model's parameters on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - BG dataset. The use of specific hyperparameters during training, such as the learning rate, batch size, and optimizer, is crucial for achieving good performance.

📄 License

The model is released under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご