Model Overview
Model Features
Model Capabilities
Use Cases
đ Wav2vec2-xls-r-1b for Finnish ASR
This acoustic model is designed for Finnish Automatic Speech Recognition (ASR). It's a fine - tuned version of facebook/wav2vec2-xls-r-1b, leveraging 275.6 hours of Finnish transcribed speech data. The Wav2Vec2 XLS - R model was introduced in this paper and first released at this page. Additionally, this repository includes a Finnish KenLM language model for use in the decoding phase with the acoustic model.
Note: This model is identical to the aapot/wav2vec2-xlsr-1b-finnish-lm-v2 model, and it has been transferred to the Finnish - NLP
Hugging Face organization.
⨠Features
- Fine - tuned for Finnish: Specifically optimized for Finnish ASR tasks.
- Multilingual Pretraining: Based on a large - scale multilingual pretrained model, Wav2Vec2 XLS - R.
- Language Model Included: Comes with a Finnish KenLM language model for decoding.
đĻ Installation
The README doesn't provide specific installation steps, so this section is skipped.
đģ Usage Examples
The README doesn't provide code examples, so this section is skipped.
đ Documentation
Model description
Wav2Vec2 XLS - R is a large - scale multilingual pretrained speech model from Facebook AI. It's pretrained on 436k hours of unlabeled speech from various sources like VoxPopuli, MLS, CommonVoice, BABEL, and VoxLingua107. It uses the wav2vec 2.0 objective across 128 languages. You can find more details about the pretrained model in this blog and this paper. This model is a fine - tuned version of the 1 - billion - parameter variant for Finnish ASR.
Intended uses & limitations
How to use
For a detailed example of using this model, check the [run - finnish - asr - models.ipynb](https://huggingface.co/Finnish - NLP/wav2vec2 - xlsr - 1b - finnish - lm - v2/blob/main/run - finnish - asr - models.ipynb) notebook in this repository.
Limitations and bias
- Audio length: This model was fine - tuned with audio samples of a maximum length of 20 seconds, so it likely performs best on short audios of similar length. However, you can try it on longer audios. If you encounter out - of - memory errors with very long audio files, you can use the audio chunking method described in [this blog post](https://huggingface.co/blog/asr - chunking).
- Domain generalization: A large portion of the fine - tuning data was from the Finnish Parliament dataset. So, the model may not generalize well to different domains such as daily spoken Finnish with dialects. Also, the datasets are male - dominated, so the model may not work as well for children's and women's speeches.
- Language model generalization: The Finnish KenLM language model used in decoding was trained with text data from audio transcriptions and a subset of Finnish Wikipedia. Thus, it may not generalize well to different language types, like daily spoken language with dialects. It might be beneficial to train your own KenLM language model for your specific domain.
Training data
This model was fine - tuned with 275.6 hours of Finnish transcribed speech data from the following datasets:
Dataset | Hours | % of total hours |
---|---|---|
[Common Voice 7.0 Finnish train + evaluation + other splits](https://huggingface.co/datasets/mozilla - foundation/common_voice_7_0) | 9.70 h | 3.52 % |
Finnish parliament session 2 | 0.24 h | 0.09 % |
VoxPopuli Finnish | 21.97 h | 7.97 % |
CSS10 Finnish | 10.32 h | 3.74 % |
[Aalto Finnish Parliament ASR Corpus](http://urn.fi/urn:nbn:fi:lb - 2021051903) | 228.00 h | 82.73 % |
[Finnish Broadcast Corpus](http://urn.fi/urn:nbn:fi:lb - 2016042502) | 5.37 h | 1.95 % |
The datasets were filtered to include audio samples with a maximum length of 20 seconds.
Training procedure
This model was trained during the [Robust Speech Challenge Event](https://discuss.huggingface.co/t/open - to - the - community - robust - speech - recognition - challenge/13614) organized by Hugging Face. The training was conducted on a Tesla V100 GPU sponsored by OVHcloud.
The training script was provided by Hugging Face and is available [here](https://github.com/huggingface/transformers/blob/main/examples/research_projects/robust - speech - event/run_speech_recognition_ctc_bnb.py). Only the data loading was modified for custom datasets.
For the KenLM language model training, the [blog post tutorial](https://huggingface.co/blog/wav2vec2 - with - ngram) provided by Hugging Face was followed. The training data for the 5 - gram KenLM consisted of text transcriptions of the audio training data and 100k random samples from the cleaned Finnish Wikipedia (August 2021) dataset.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e - 05
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- optimizer: 8 - bit Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 10
- mixed_precision_training: Native AMP
The pretrained facebook/wav2vec2 - xls - r - 1b
model was initialized with the following hyperparameters:
- attention_dropout: 0.094
- hidden_dropout: 0.047
- feat_proj_dropout: 0.04
- mask_time_prob: 0.082
- layerdrop: 0.041
- activation_dropout: 0.055
- ctc_loss_reduction: "mean"
Training results
Training Loss | Epoch | Step | Validation Loss | Wer |
---|---|---|---|---|
0.7778 | 0.17 | 500 | 0.2851 | 0.3572 |
0.5506 | 0.34 | 1000 | 0.1595 | 0.2130 |
0.6569 | 0.5 | 1500 | 0.1458 | 0.2046 |
0.5997 | 0.67 | 2000 | 0.1374 | 0.1975 |
0.542 | 0.84 | 2500 | 0.1390 | 0.1956 |
0.4815 | 1.01 | 3000 | 0.1266 | 0.1813 |
0.6982 | 1.17 | 3500 | 0.1441 | 0.1965 |
0.4522 | 1.34 | 4000 | 0.1232 | 0.1822 |
0.4655 | 1.51 | 4500 | 0.1209 | 0.1702 |
0.4069 | 1.68 | 5000 | 0.1149 | 0.1688 |
0.4226 | 1.84 | 5500 | 0.1121 | 0.1560 |
0.3993 | 2.01 | 6000 | 0.1091 | 0.1557 |
0.406 | 2.18 | 6500 | 0.1115 | 0.1553 |
0.4098 | 2.35 | 7000 | 0.1144 | 0.1560 |
0.3995 | 2.51 | 7500 | 0.1028 | 0.1476 |
0.4101 | 2.68 | 8000 | 0.1129 | 0.1511 |
0.3636 | 2.85 | 8500 | 0.1025 | 0.1517 |
0.3534 | 3.02 | 9000 | 0.1068 | 0.1480 |
0.3836 | 3.18 | 9500 | 0.1072 | 0.1459 |
0.3531 | 3.35 | 10000 | 0.0928 | 0.1367 |
0.3649 | 3.52 | 10500 | 0.1042 | 0.1426 |
0.3645 | 3.69 | 11000 | 0.0979 | 0.1433 |
0.3685 | 3.85 | 11500 | 0.0947 | 0.1346 |
0.3325 | 4.02 | 12000 | 0.0991 | 0.1352 |
0.3497 | 4.19 | 12500 | 0.0919 | 0.1358 |
0.3303 | 4.36 | 13000 | 0.0888 | 0.1272 |
0.3323 | 4.52 | 13500 | 0.0888 | 0.1277 |
0.3452 | 4.69 | 14000 | 0.0894 | 0.1279 |
0.337 | 4.86 | 14500 | 0.0917 | 0.1289 |
0.3114 | 5.03 | 15000 | 0.0942 | 0.1313 |
0.3099 | 5.19 | 15500 | 0.0902 | 0.1239 |
0.3079 | 5.36 | 16000 | 0.0871 | 0.1256 |
0.3293 | 5.53 | 16500 | 0.0861 | 0.1263 |
0.3123 | 5.7 | 17000 | 0.0876 | 0.1203 |
0.3093 | 5.86 | 17500 | 0.0848 | 0.1226 |
0.2903 | 6.03 | 18000 | 0.0914 | 0.1221 |
0.297 | 6.2 | 18500 | 0.0841 | 0.1185 |
0.2797 | 6.37 | 19000 | 0.0858 | 0.1165 |
0.2878 | 6.53 | 19500 | 0.0874 | 0.1161 |
0.2974 | 6.7 | 20000 | 0.0835 | 0.1173 |
0.3051 | 6.87 | 20500 | 0.0835 | 0.1178 |
0.2941 | 7.04 | 21000 | 0.0852 | 0.1155 |
0.258 | 7.21 | 21500 | 0.0832 | 0.1132 |
0.2778 | 7.37 | 22000 | 0.0829 | 0.1110 |
0.2751 | 7.54 | 22500 | 0.0822 | 0.1069 |
0.2887 | 7.71 | 23000 | 0.0819 | 0.1103 |
0.2509 | 7.88 | 23500 | 0.0787 | 0.1055 |
0.2501 | 8.04 | 24000 | 0.0807 | 0.1076 |
0.2399 | 8.21 | 24500 | 0.0784 | 0.1052 |
0.2539 | 8.38 | 25000 | 0.0772 | 0.1075 |
0.248 | 8.55 | 25500 | 0.0772 | 0.1055 |
0.2689 | 8.71 | 26000 | 0.0763 | 0.1027 |
0.2855 | 8.88 | 26500 | 0.0756 | 0.1035 |
0.2421 | 9.05 | 27000 | 0.0771 | 0.0998 |
0.2497 | 9.22 | 27500 | 0.0756 | 0.0971 |
0.2367 | 9.38 | 28000 | 0.0741 | 0.0974 |
0.2473 | 9.55 | 28500 | 0.0739 | 0.0982 |
0.2396 | 9.72 | 29000 | 0.0756 | 0.0991 |
0.2602 | 9.89 | 29500 | 0.0737 | 0.0975 |
Framework versions
- Transformers 4.17.0.dev0
- Pytorch 1.10.2+cu102
- Datasets 1.18.3
- Tokenizers 0.11.0
Evaluation results
Evaluation was performed using the [Common Voice 7.0 Finnish test split](https://huggingface.co/datasets/mozilla - foundation/common_voice_7_0), [Common Voice 9.0 Finnish test split](https://huggingface.co/datasets/mozilla - foundation/common_voice_9_0), and the FLEURS ASR Finnish test split.
The training data for this model includes the training splits of Common Voice 7.0. However, the newer Finnish - NLP/wav2vec2 - base - fi - voxpopuli - v2 - finetuned
and Finnish - NLP/wav2vec2 - large - uralic - voxpopuli - v2 - finnish
models include Common Voice 9.0. So, tests were run for both Common Voice versions. Note that Common Voice may not fully preserve the test split between dataset versions, so some training examples of Common Voice 9.0 might be in the test split of Common Voice 7.0 and vice versa. Thus, test result comparisons between models trained with different Common Voice versions are not entirely accurate but still meaningful.
Common Voice 7.0 testing
To evaluate this model, run the eval.py
script in this repository:
python3 eval.py --model_id Finnish - NLP/wav2vec2 - xlsr - 1b - finnish - lm - v2 --dataset mozilla - foundation/common_voice_7_0 --config fi --split test
This model (the fifth row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to other models and their parameter counts:
Model parameters | WER (with LM) | WER (without LM) | CER (with LM) | CER (without LM) | |
---|---|---|---|---|---|
Finnish - NLP/wav2vec2 - base - fi - voxpopuli - v2 - finetuned | 95 million | 5.85 | 13.52 | 1.35 | 2.44 |
Finnish - NLP/wav2vec2 - large - uralic - voxpopuli - v2 - finnish | 300 million | 4.13 | 9.66 | 0.90 | 1.66 |
đ§ Technical Details
The README provides detailed technical information about the model, training, and evaluation, so this section is covered in the "Documentation" part.
đ License
The model is licensed under the Apache - 2.0 license.

