Model Overview
Model Features
Model Capabilities
Use Cases
đ Wav2Vec2 XLS-R for Finnish ASR
This acoustic model is designed for Finnish Automatic Speech Recognition (ASR). It's a fine - tuned version of facebook/wav2vec2-xls-r-1b, trained with 275.6 hours of Finnish transcribed speech data. Wav2Vec2 XLS - R was introduced in this paper and first released at this page.
Note: There's a version using the KenLM language model in the decoding phase, which produces better transcriptions: Finnish - NLP/wav2vec2-xlsr-1b-finnish-lm-v2
⨠Features
- Multilingual Pretraining: Wav2Vec2 XLS - R is a large - scale multilingual pretrained model for speech, pretrained on 436k hours of unlabeled speech across 128 languages, including VoxPopuli, MLS, CommonVoice, BABEL, and VoxLingua107.
- Fine - Tuned for Finnish: Specifically fine - tuned for Finnish ASR using a diverse set of Finnish speech datasets.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Model description
Wav2Vec2 XLS - R is Facebook AI's large - scale multilingual pretrained model for speech. It uses the wav2vec 2.0 objective and is pretrained on a vast amount of unlabeled speech data from multiple sources. This particular model is a fine - tuned variant of the 1 - billion - parameter version for Finnish ASR.
You can read more about the pretrained model from this blog and this paper.
Intended uses & limitations
How to use
Check the run - finnish - asr - models.ipynb notebook in this repository for a detailed example on how to use this model.
Limitations and bias
- Audio Length: The model was fine - tuned with audio samples of maximum 20 seconds. It likely works best for short audios of similar length. For very long audio files, you can use the audio chunking method introduced in [this blog post](https://huggingface.co/blog/asr - chunking) if you encounter out - of - memory errors.
- Domain Generalization: A large portion of the fine - tuning data was from the Finnish Parliament dataset. So, the model may not generalize well to different domains such as common daily spoken Finnish with dialects.
- Gender and Age Bias: The datasets' audios are mostly from adult males. Thus, the model may not perform as well for children and women's speeches.
Training data
This model was fine - tuned with 275.6 hours of Finnish transcribed speech data from the following datasets:
Property | Details |
---|---|
[Common Voice 7.0 Finnish train + evaluation + other splits](https://huggingface.co/datasets/mozilla - foundation/common_voice_7_0) | 9.70 h (3.52%) |
Finnish parliament session 2 | 0.24 h (0.09%) |
VoxPopuli Finnish | 21.97 h (7.97%) |
CSS10 Finnish | 10.32 h (3.74%) |
[Aalto Finnish Parliament ASR Corpus](http://urn.fi/urn:nbn:fi:lb - 2021051903) | 228.00 h (82.73%) |
[Finnish Broadcast Corpus](http://urn.fi/urn:nbn:fi:lb - 2016042502) | 5.37 h (1.95%) |
The datasets were filtered to include audio samples with a maximum length of 20 seconds.
Training procedure
This model was trained during the [Robust Speech Challenge Event](https://discuss.huggingface.co/t/open - to - the - community - robust - speech - recognition - challenge/13614) organized by Hugging Face. Training was done on a Tesla V100 GPU, sponsored by OVHcloud.
The training script was provided by Hugging Face and is available [here](https://github.com/huggingface/transformers/blob/main/examples/research_projects/robust - speech - event/run_speech_recognition_ctc_bnb.py). Only the data loading was modified for custom datasets.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e - 05
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- optimizer: 8 - bit Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 10
- mixed_precision_training: Native AMP
The pretrained facebook/wav2vec2 - xls - r - 1b
model was initialized with the following hyperparameters:
- attention_dropout: 0.094
- hidden_dropout: 0.047
- feat_proj_dropout: 0.04
- mask_time_prob: 0.082
- layerdrop: 0.041
- activation_dropout: 0.055
- ctc_loss_reduction: "mean"
Training results
Training Loss | Epoch | Step | Validation Loss | Wer |
---|---|---|---|---|
0.7778 | 0.17 | 500 | 0.2851 | 0.3572 |
0.5506 | 0.34 | 1000 | 0.1595 | 0.2130 |
0.6569 | 0.5 | 1500 | 0.1458 | 0.2046 |
0.5997 | 0.67 | 2000 | 0.1374 | 0.1975 |
0.542 | 0.84 | 2500 | 0.1390 | 0.1956 |
0.4815 | 1.01 | 3000 | 0.1266 | 0.1813 |
0.6982 | 1.17 | 3500 | 0.1441 | 0.1965 |
0.4522 | 1.34 | 4000 | 0.1232 | 0.1822 |
0.4655 | 1.51 | 4500 | 0.1209 | 0.1702 |
0.4069 | 1.68 | 5000 | 0.1149 | 0.1688 |
0.4226 | 1.84 | 5500 | 0.1121 | 0.1560 |
0.3993 | 2.01 | 6000 | 0.1091 | 0.1557 |
0.406 | 2.18 | 6500 | 0.1115 | 0.1553 |
0.4098 | 2.35 | 7000 | 0.1144 | 0.1560 |
0.3995 | 2.51 | 7500 | 0.1028 | 0.1476 |
0.4101 | 2.68 | 8000 | 0.1129 | 0.1511 |
0.3636 | 2.85 | 8500 | 0.1025 | 0.1517 |
0.3534 | 3.02 | 9000 | 0.1068 | 0.1480 |
0.3836 | 3.18 | 9500 | 0.1072 | 0.1459 |
0.3531 | 3.35 | 10000 | 0.0928 | 0.1367 |
0.3649 | 3.52 | 10500 | 0.1042 | 0.1426 |
0.3645 | 3.69 | 11000 | 0.0979 | 0.1433 |
0.3685 | 3.85 | 11500 | 0.0947 | 0.1346 |
0.3325 | 4.02 | 12000 | 0.0991 | 0.1352 |
0.3497 | 4.19 | 12500 | 0.0919 | 0.1358 |
0.3303 | 4.36 | 13000 | 0.0888 | 0.1272 |
0.3323 | 4.52 | 13500 | 0.0888 | 0.1277 |
0.3452 | 4.69 | 14000 | 0.0894 | 0.1279 |
0.337 | 4.86 | 14500 | 0.0917 | 0.1289 |
0.3114 | 5.03 | 15000 | 0.0942 | 0.1313 |
0.3099 | 5.19 | 15500 | 0.0902 | 0.1239 |
0.3079 | 5.36 | 16000 | 0.0871 | 0.1256 |
0.3293 | 5.53 | 16500 | 0.0861 | 0.1263 |
0.3123 | 5.7 | 17000 | 0.0876 | 0.1203 |
0.3093 | 5.86 | 17500 | 0.0848 | 0.1226 |
0.2903 | 6.03 | 18000 | 0.0914 | 0.1221 |
0.297 | 6.2 | 18500 | 0.0841 | 0.1185 |
0.2797 | 6.37 | 19000 | 0.0858 | 0.1165 |
0.2878 | 6.53 | 19500 | 0.0874 | 0.1161 |
0.2974 | 6.7 | 20000 | 0.0835 | 0.1173 |
0.3051 | 6.87 | 20500 | 0.0835 | 0.1178 |
0.2941 | 7.04 | 21000 | 0.0852 | 0.1155 |
0.258 | 7.21 | 21500 | 0.0832 | 0.1132 |
0.2778 | 7.37 | 22000 | 0.0829 | 0.1110 |
0.2751 | 7.54 | 22500 | 0.0822 | 0.1069 |
0.2887 | 7.71 | 23000 | 0.0819 | 0.1103 |
0.2509 | 7.88 | 23500 | 0.0787 | 0.1055 |
0.2501 | 8.04 | 24000 | 0.0807 | 0.1076 |
0.2399 | 8.21 | 24500 | 0.0784 | 0.1052 |
0.2539 | 8.38 | 25000 | 0.0772 | 0.1075 |
0.248 | 8.55 | 25500 | 0.0772 | 0.1055 |
0.2689 | 8.71 | 26000 | 0.0763 | 0.1027 |
0.2855 | 8.88 | 26500 | 0.0756 | 0.1035 |
0.2421 | 9.05 | 27000 | 0.0771 | 0.0998 |
0.2497 | 9.22 | 27500 | 0.0756 | 0.0971 |
0.2367 | 9.38 | 28000 | 0.0741 | 0.0974 |
0.2473 | 9.55 | 28500 | 0.0739 | 0.0982 |
0.2396 | 9.72 | 29000 | 0.0756 | 0.0991 |
0.2602 | 9.89 | 29500 | 0.0737 | 0.0975 |
Framework versions
- Transformers 4.17.0.dev0
- Pytorch 1.10.2+cu102
- Datasets 1.18.3
- Tokenizers 0.11.0
Evaluation results
Evaluation was done with the [Common Voice 7.0 Finnish test split](https://huggingface.co/datasets/mozilla - foundation/common_voice_7_0).
To evaluate this model, run the eval.py
script in this repository:
python3 eval.py --model_id aapot/wav2vec2-xlsr-1b-finnish-v2 --dataset mozilla-foundation/common_voice_7_0 --config fi --split test
This model (the first row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to other models:
Model | WER (with LM) | WER (without LM) | CER (with LM) | CER (without LM) |
---|---|---|---|---|
aapot/wav2vec2-xlsr-1b-finnish-lm-v2 | 4.09 | 9.73 | 0.88 | 1.65 |
aapot/wav2vec2-xlsr-1b-finnish-lm | 5.65 | 13.11 | 1.20 | 2.23 |
aapot/wav2vec2-xlsr-300m-finnish-lm | 8.16 | 17.92 | 1.97 | 3.36 |
Team Members
- Aapo Tanskanen, Hugging Face profile, LinkedIn profile
- Rasmus Toivanen, Hugging Face profile, LinkedIn profile
Feel free to contact us for more details đ¤
đ§ Technical Details
The model is based on the Wav2Vec2 XLS - R architecture, which is a multilingual pretrained model for speech. The fine - tuning process involves adjusting the model's parameters to better fit the Finnish ASR task using a combination of hyperparameters and specific datasets. The use of 8 - bit Adam optimizer and Native AMP for mixed - precision training helps in efficient training on a Tesla V100 GPU.
đ License
The model is licensed under the Apache 2.0 license.

