Open-source acoustic model wav2vec2-xlsr-300m-finnish - Accurately implement automatic speech recognition for Finnish

Wav2vec2 Xlsr 300m Finnish

Developed by aapot

An acoustic model fine-tuned for Finnish automatic speech recognition based on facebook/wav2vec2-xls-r-300m, trained with 275.6 hours of annotated Finnish speech data

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Finnish speech recognition #Multilingual pre-training fine-tuning #Parliament scenario optimization

Downloads 96

Release Time : 3/2/2022

Model Overview

This model is suitable for Finnish speech-to-text tasks, being a fine-tuned version of the Wav2Vec2 XLS-R pre-trained model, supporting Finnish automatic speech recognition.

Model Features

Multilingual Pre-training Foundation

Fine-tuned from the Wav2Vec2 XLS-R pre-trained model supporting 128 languages, with robust speech representation capabilities

Efficient Fine-tuning

Utilized 275.6 hours of annotated Finnish data for targeted fine-tuning, optimizing Finnish recognition performance

Supports Language Model Enhancement

Can be combined with KenLM language model to further improve transcription accuracy

Model Capabilities

Finnish speech recognition

Short audio transcription (up to 20 seconds)

Speech-to-text

Use Cases

Speech Transcription

Parliament Meeting Minutes

Transcribing Finnish parliamentary meeting audio content

Performs well on parliamentary datasets

Daily Speech Transcription

Converting Finnish daily conversations to text

Effective for standard pronunciation, limited dialect recognition

🚀 Wav2Vec2 XLS-R for Finnish ASR

This acoustic model is a fine - tuned version of facebook/wav2vec2-xls-r-300m for Finnish Automatic Speech Recognition (ASR), providing high - quality speech - to - text conversion for the Finnish language.

✨ Features

Fine - Tuned for Finnish: Based on the pre - trained facebook/wav2vec2-xls-r-300m, fine - tuned with 275.6 hours of Finnish transcribed speech data.
Multilingual Pretrained Base: Wav2Vec2 XLS - R is pretrained on 436k hours of unlabeled speech in 128 languages, including VoxPopuli, MLS, CommonVoice, BABEL, and VoxLingua107.
Available with LM Version: There is a version with KenLM language model used in the decoding phase, which can produce better transcriptions: [Finnish - NLP/wav2vec2-xlsr-300m-finnish-lm](https://huggingface.co/Finnish - NLP/wav2vec2-xlsr-300m-finnish-lm).

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Intended uses & limitations

Intended Use: This model can be used for Finnish ASR (speech - to - text) task.
How to Use: Check the [run - finnish - asr - models.ipynb](https://huggingface.co/aapot/wav2vec2-xlsr-300m-finnish/blob/main/run - finnish - asr - models.ipynb) notebook in this repository for a detailed example on how to use this model.
Limitations and bias:
- Audio Length: This model was fine - tuned with audio samples of maximum length 20 seconds, so it most likely works best for short audios of similar length. However, you can try it with longer audios. If you encounter out - of - memory errors with very long audio files, you can use the audio chunking method introduced in [this blog post](https://huggingface.co/blog/asr - chunking).
- Data Domain: A vast majority of the data used for fine - tuning was from the Finnish Parliament dataset, so this model may not generalize well to very different domains like common daily spoken Finnish with dialects. Also, audios of the datasets tend to be adult male - dominated, so this model may not work as well for speeches of children and women.

Training data

This model was fine - tuned with 275.6 hours of Finnish transcribed speech data from the following datasets:

Dataset	Hours	% of total hours
[Common Voice 7.0 Finnish train + evaluation + other splits](https://huggingface.co/datasets/mozilla - foundation/common_voice_7_0)	9.70 h	3.52 %
Finnish parliament session 2	0.24 h	0.09 %
VoxPopuli Finnish	21.97 h	7.97 %
CSS10 Finnish	10.32 h	3.74 %
[Aalto Finnish Parliament ASR Corpus](http://urn.fi/urn:nbn:fi:lb - 2021051903)	228.00 h	82.73 %
[Finnish Broadcast Corpus](http://urn.fi/urn:nbn:fi:lb - 2016042502)	5.37 h	1.95 %

Datasets were filtered to include audio samples with a maximum length of 20 seconds.

Training procedure

This model was trained during the [Robust Speech Challenge Event](https://discuss.huggingface.co/t/open - to - the - community - robust - speech - recognition - challenge/13614) organized by Hugging Face. Training was done on a Tesla V100 GPU, sponsored by OVHcloud. The training script was provided by Hugging Face and is available [here](https://github.com/huggingface/transformers/blob/main/examples/research_projects/robust - speech - event/run_speech_recognition_ctc_bnb.py). Only the data loading was modified for custom datasets.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 04
train_batch_size: 32
eval_batch_size: 32
seed: 42
optimizer: 8 - bit Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 10
mixed_precision_training: Native AMP

The pretrained facebook/wav2vec2 - xls - r - 300m model was initialized with the following hyperparameters:

attention_dropout: 0.094
hidden_dropout: 0.047
feat_proj_dropout: 0.04
mask_time_prob: 0.082
layerdrop: 0.041
activation_dropout: 0.055
ctc_loss_reduction: "mean"

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.973	0.17	500	0.5750	0.6844
0.713	0.34	1000	0.3356	0.4518
0.6563	0.5	1500	0.3007	0.4039
0.642	0.67	2000	0.2619	0.3674
0.6203	0.84	2500	0.2488	0.3558
0.6016	1.01	3000	0.2795	0.3835
0.5423	1.17	3500	0.2652	0.3310
0.5639	1.34	4000	0.2479	0.3462
0.586	1.51	4500	0.2409	0.3295
0.5169	1.68	5000	0.2728	0.3352
0.5176	1.84	5500	0.2254	0.3149
0.4983	2.01	6000	0.2169	0.3009
0.4982	2.18	6500	0.2215	0.3079
0.4898	2.35	7000	0.2174	0.3023
0.4922	2.51	7500	0.2217	0.3081
0.5025	2.68	8000	0.2002	0.2710
0.4745	2.85	8500	0.1935	0.2783
0.4377	3.02	9000	0.1859	0.2742
0.4511	3.18	9500	0.2038	0.2786
0.4411	3.35	10000	0.1863	0.2651
0.4501	3.52	10500	0.1948	0.2605
0.4557	3.69	11000	0.1872	0.2695
0.4493	3.85	11500	0.1888	0.2632
0.4047	4.02	12000	0.1818	0.2559
0.4319	4.19	12500	0.1896	0.2648
0.4162	4.36	13000	0.1953	0.2595
0.4046	4.52	13500	0.1864	0.2606
0.4195	4.69	14000	0.1843	0.2467
0.4146	4.86	14500	0.1686	0.2450
0.378	5.03	15000	0.1731	0.2401
0.3792	5.19	15500	0.1676	0.2325
0.3855	5.36	16000	0.1740	0.2326
0.4029	5.53	16500	0.1674	0.2345
0.386	5.7	17000	0.1735	0.2280
0.3811	5.86	17500	0.1692	0.2258
0.3607	6.03	18000	0.1797	0.2279
0.3604	6.2	18500	0.1651	0.2206
0.3362	6.37	19000	0.1627	0.2199
0.3611	6.53	19500	0.1652	0.2172
0.3671	6.7	20000	0.1564	0.2140
0.3769	6.87	20500	0.1525	0.2101
0.3539	7.04	21000	0.1639	0.2096
0.3225	7.21	21500	0.1611	0.2087
0.3323	7.37	22000	0.1633	0.2008
0.3327	7.54	22500	0.1692	0.1975
0.3456	7.71	23000	0.1555	0.1991
0.3058	7.88	23500	0.1590	0.1959
0.3034	8.04	24000	0.1531	0.1973
0.2925	8.21	24500	0.1583	0.1978
0.2967	8.38	25000	0.1546	0.1906
0.2974	8.55	25500	0.1540	0.1869
0.3131	8.71	26000	0.1534	0.1850
0.3306	8.88	26500	0.1482	0.1844
0.2842	9.05	27000	0.1490	0.1854
0.2879	9.22	27500	0.1463	0.1799
0.27	9.38	28000	0.1454	0.1798
0.2874	9.55	28500	0.1504	0.1787
0.2757	9.72	29000	0.1512	0.1784
0.3017	9.89	29500	0.1484	0.1800

Framework versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.3
Tokenizers 0.11.0

Evaluation results

Evaluation was done with the [Common Voice 7.0 Finnish test split](https://huggingface.co/datasets/mozilla - foundation/common_voice_7_0). To evaluate this model, run the eval.py script in this repository:

python3 eval.py --model_id aapot/wav2vec2-xlsr-300m-finnish --dataset mozilla-foundation/common_voice_7_0 --config fi --split test

This model (the third row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to other models:

	WER (with LM)	WER (without LM)	CER (with LM)	CER (without LM)
aapot/wav2vec2-xlsr-1b-finnish-lm-v2	4.09	9.73	0.88	1.65
aapot/wav2vec2-xlsr-1b-finnish-lm	5.65	13.11	1.20	2.23
aapot/wav2vec2-xlsr-300m-finnish-lm	8.16	17.92	1.97	3.36

Team Members

Aapo Tanskanen, Hugging Face profile, LinkedIn profile
Rasmus Toivanen, Hugging Face profile, LinkedIn profile

Feel free to contact us for more details 🤗

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご