Open-source wav2vec2-xlsr-1b-finnish-lm-v2 model - Free deployment facilitates Finnish automatic speech recognition

Wav2vec2 Xlsr 1b Finnish Lm V2

Developed by aapot

A fine-tuned version of Facebook's wav2vec2-xls-r-1b model for Finnish automatic speech recognition tasks, trained on 275.6 hours of annotated Finnish speech data

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Finnish speech recognition #Low word error rate (4.09%)#Large model fine-tuning

Downloads 61

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) model for Finnish speech-to-text conversion, including an acoustic model and a KenLM language model, achieving a 4.09% word error rate on the Common Voice 7.0 test set

Model Features

High-performance Finnish recognition

Achieves a 4.09% word error rate and 0.88% character error rate on the Common Voice 7.0 test set

Large-scale pre-training foundation

Fine-tuned from the 1-billion-parameter wav2vec2-xls-r-1b model, which was pre-trained on 436,000 hours of multilingual data

Integrated language model

Includes a KenLM 5-gram language model specifically optimized for Finnish, significantly improving decoding performance

Diverse training data

Fine-tuned using 275.6 hours of Finnish data from various sources, including Common Voice, parliamentary sessions, broadcasts, and other scenarios

Model Capabilities

Finnish speech recognition

Short audio transcription (up to 20 seconds)

Speech decoding with language model

Use Cases

Speech-to-text

Meeting transcription

Automatically converts Finnish meeting recordings into text records

Suitable for formal speech with relatively high accuracy

Voice assistant

Provides speech recognition capabilities for Finnish voice assistants

Note the adaptability to informal speech

Media processing

Broadcast subtitle generation

Automatically generates subtitles for Finnish broadcast programs

Performs well with standard broadcast speech

🚀 Wav2Vec2 XLS-R for Finnish ASR

This acoustic model is designed for Finnish Automatic Speech Recognition (ASR). It's a fine - tuned version of facebook/wav2vec2-xls-r-1b, trained with 275.6 hours of Finnish transcribed speech data. Wav2Vec2 XLS - R was introduced in this paper and first released at this page. This repository also includes a Finnish KenLM language model for use in the decoding phase with the acoustic model.

🚀 Quick Start

To use this model for Finnish ASR, check the run-finnish-asr-models.ipynb notebook in this repository for a detailed example.

✨ Features

Fine - tuned for Finnish: Specifically tailored for Finnish ASR, leveraging 275.6 hours of Finnish transcribed speech data.
Multilingual Pretrained Base: Built on the large - scale multilingual pretrained model Wav2Vec2 XLS - R, pretrained on 436k hours of unlabeled speech across 128 languages.
Language Model Included: Comes with a Finnish KenLM language model for decoding.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

Model description

Wav2Vec2 XLS - R is a large - scale multilingual pretrained speech model from Facebook AI. It's pretrained on 436k hours of unlabeled speech from various sources like VoxPopuli, MLS, CommonVoice, BABEL, and VoxLingua107, using the wav2vec 2.0 objective in 128 languages. You can learn more about the pretrained model from this blog and this paper. This model is a fine - tuned version of the 1 - billion - parameter variant for Finnish ASR.

Intended uses & limitations

Intended use: This model can be used for Finnish ASR (speech - to - text) tasks.
Limitations:
- Audio length: The model was fine - tuned with audio samples of maximum 20 seconds, so it may work best for short audios of similar length. For very long audio files, you can try the audio chunking method introduced in this blog post if you encounter out - of - memory errors.
- Domain generalization: A large part of the fine - tuning data is from the Finnish Parliament dataset, so the model may not generalize well to different domains like daily spoken Finnish with dialects.
- Gender bias: The datasets are mostly dominated by adult male voices, so the model may not perform as well for children and women's speeches.
- Language model generalization: The Finnish KenLM language model used in decoding was trained with text from audio transcriptions and a subset of Finnish Wikipedia. It may not generalize well to different language styles, such as daily spoken language with dialects. It may be beneficial to train your own KenLM language model for your specific domain.

Training data

This model was fine - tuned with 275.6 hours of Finnish transcribed speech data from the following datasets:

Dataset	Hours	% of total hours
Common Voice 7.0 Finnish train + evaluation + other splits	9.70 h	3.52 %
Finnish parliament session 2	0.24 h	0.09 %
VoxPopuli Finnish	21.97 h	7.97 %
CSS10 Finnish	10.32 h	3.74 %
Aalto Finnish Parliament ASR Corpus	228.00 h	82.73 %
Finnish Broadcast Corpus	5.37 h	1.95 %

The datasets were filtered to include audio samples of maximum 20 seconds in length.

Training procedure

Event: This model was trained during the Robust Speech Challenge Event organized by Hugging Face.
Hardware: Training was done on a Tesla V100 GPU, sponsored by OVHcloud.
Training script: The training script was provided by Hugging Face and is available here. Only the data loading was modified for custom datasets.
KenLM language model training: The training of the 5 - gram KenLM language model followed the blog post tutorial provided by Hugging Face. The training data included text transcriptions of the audio training data and 100k random samples of the cleaned Finnish Wikipedia (August 2021) dataset.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 32
eval_batch_size: 8
seed: 42
optimizer: 8 - bit Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 10
mixed_precision_training: Native AMP

The pretrained facebook/wav2vec2-xls-r-1b model was initialized with the following hyperparameters:

attention_dropout: 0.094
hidden_dropout: 0.047
feat_proj_dropout: 0.04
mask_time_prob: 0.082
layerdrop: 0.041
activation_dropout: 0.055
ctc_loss_reduction: "mean"

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.7778	0.17	500	0.2851	0.3572
0.5506	0.34	1000	0.1595	0.2130
0.6569	0.5	1500	0.1458	0.2046
0.5997	0.67	2000	0.1374	0.1975
0.542	0.84	2500	0.1390	0.1956
0.4815	1.01	3000	0.1266	0.1813
0.6982	1.17	3500	0.1441	0.1965
0.4522	1.34	4000	0.1232	0.1822
0.4655	1.51	4500	0.1209	0.1702
0.4069	1.68	5000	0.1149	0.1688
0.4226	1.84	5500	0.1121	0.1560
0.3993	2.01	6000	0.1091	0.1557
0.406	2.18	6500	0.1115	0.1553
0.4098	2.35	7000	0.1144	0.1560
0.3995	2.51	7500	0.1028	0.1476
0.4101	2.68	8000	0.1129	0.1511
0.3636	2.85	8500	0.1025	0.1517
0.3534	3.02	9000	0.1068	0.1480
0.3836	3.18	9500	0.1072	0.1459
0.3531	3.35	10000	0.0928	0.1367
0.3649	3.52	10500	0.1042	0.1426
0.3645	3.69	11000	0.0979	0.1433
0.3685	3.85	11500	0.0947	0.1346
0.3325	4.02	12000	0.0991	0.1352
0.3497	4.19	12500	0.0919	0.1358
0.3303	4.36	13000	0.0888	0.1272
0.3323	4.52	13500	0.0888	0.1277
0.3452	4.69	14000	0.0894	0.1279
0.337	4.86	14500	0.0917	0.1289
0.3114	5.03	15000	0.0942	0.1313
0.3099	5.19	15500	0.0902	0.1239
0.3079	5.36	16000	0.0871	0.1256
0.3293	5.53	16500	0.0861	0.1263
0.3123	5.7	17000	0.0876	0.1203
0.3093	5.86	17500	0.0848	0.1226
0.2903	6.03	18000	0.0914	0.1221
0.297	6.2	18500	0.0841	0.1185
0.2797	6.37	19000	0.0858	0.1165
0.2878	6.53	19500	0.0874	0.1161
0.2974	6.7	20000	0.0835	0.1173
0.3051	6.87	20500	0.0835	0.1178
0.2941	7.04	21000	0.0852	0.1155
0.258	7.21	21500	0.0832	0.1132
0.2778	7.37	22000	0.0829	0.1110
0.2751	7.54	22500	0.0822	0.1069
0.2887	7.71	23000	0.0819	0.1103
0.2509	7.88	23500	0.0787	0.1055
0.2501	8.04	24000	0.0807	0.1076
0.2399	8.21	24500	0.0784	0.1052
0.2539	8.38	25000	0.0772	0.1075
0.248	8.55	25500	0.0772	0.1055
0.2689	8.71	26000	0.0763	0.1027
0.2855	8.88	26500	0.0756	0.1035
0.2421	9.05	27000	0.0771	0.0998
0.2497	9.22	27500	0.0756	0.0971
0.2367	9.38	28000	0.0741	0.0974
0.2473	9.55	28500	0.0739	0.0982
0.2396	9.72	29000	0.0756	0.0991
0.2602	9.89	29500	0.0737	0.0975

Framework versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.3
Tokenizers 0.11.0

Evaluation results

Evaluation was done with the Common Voice 7.0 Finnish test split. To evaluate this model, run the eval.py script in this repository:

python3 eval.py --model_id aapot/wav2vec2-xlsr-1b-finnish-lm-v2 --dataset mozilla-foundation/common_voice_7_0 --config fi --split test

This model (the first row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to other models:

	WER (with LM)	WER (without LM)	CER (with LM)	CER (without LM)
aapot/wav2vec2-xlsr-1b-finnish-lm-v2	4.09	9.73	0.88	1.65
aapot/wav2vec2-xlsr-1b-finnish-lm	5.65	13.11	1.20	2.23
aapot/wav2vec2-xlsr-300m-finnish-lm	8.16	17.92	1.97	3.36

🔧 Technical Details

The model is based on the Wav2Vec2 XLS - R architecture. The fine - tuning process involves adjusting the pretrained model's parameters using Finnish transcribed speech data. The use of the KenLM language model in the decoding phase helps improve the accuracy of speech recognition.

📄 License

This model is released under the Apache - 2.0 license.

Team Members

Aapo Tanskanen, Hugging Face profile, LinkedIn profile
Rasmus Toivanen, Hugging Face profile, LinkedIn profile

Feel free to contact us for more details 🤗

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご