Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Wav2Vec2-base-fi-voxpopuli-v2 for Finnish ASR
This acoustic model is a fine-tuned version of facebook/wav2vec2-base-fi-voxpopuli-v2 for Finnish Automatic Speech Recognition (ASR), offering high - quality speech - to - text conversion.
✨ Features
- Fine - tuned for Finnish: Based on the pre - trained facebook/wav2vec2-base-fi-voxpopuli-v2, fine - tuned with 276.7 hours of Finnish transcribed speech data.
- Including Language Model: The repository also includes a Finnish KenLM language model for use in the decoding phase with the acoustic model.
📚 Documentation
Model description
Wav2vec2-base-fi-voxpopuli-v2 is a pretrained model by Facebook AI for Finnish speech. It is pretrained on 14.2k hours of unlabeled Finnish speech from the VoxPopuli V2 dataset using the wav2vec 2.0 objective. This model is a fine - tuned version of the pretrained model for Finnish ASR.
Intended uses & limitations
How to use
Check the run-finnish-asr-models.ipynb notebook in this repository for a detailed example of how to use this model.
Limitations and bias
- Audio length: This model was fine - tuned with audio samples of a maximum length of 20 seconds, so it likely works best for relatively short audios of similar length. However, you can try it with longer audios and see the performance. If you encounter out - of - memory errors with very long audio files, you can use the audio chunking method introduced in this blog post.
- Data domain: A large portion of the fine - tuning data was from the Finnish Parliament dataset, so the model may not generalize well to very different domains such as common daily spoken Finnish with dialects.
- Gender bias: The datasets' audios tend to be dominated by adult males, so the model may not perform as well for children's and women's speeches.
- Language model: The Finnish KenLM language model used in the decoding phase was trained with text data from audio transcriptions and a subset of Finnish Wikipedia. Thus, it may not generalize to very different language types, such as spoken daily language with dialects. It may be beneficial to train your own KenLM language model for your specific domain.
Training data
This model was fine - tuned with 276.7 hours of Finnish transcribed speech data from the following datasets:
Dataset | Hours | % of total hours |
---|---|---|
Common Voice 9.0 Finnish train + evaluation + other splits | 10.80 h | 3.90 % |
Finnish parliament session 2 | 0.24 h | 0.09 % |
VoxPopuli Finnish | 21.97 h | 7.94 % |
CSS10 Finnish | 10.32 h | 3.73 % |
Aalto Finnish Parliament ASR Corpus | 228.00 h | 82.40 % |
Finnish Broadcast Corpus | 5.37 h | 1.94 % |
The datasets were filtered to include audio samples with a maximum length of 20 seconds.
Training procedure
This model was trained on a Tesla V100 GPU, sponsored by Hugging Face & OVHcloud. The training script was provided by Hugging Face and is available here. We only modified its data loading for our custom datasets.
For the KenLM language model training, we followed the blog post tutorial provided by Hugging Face. The training data for the 5 - gram KenLM were text transcriptions of the audio training data and 100k random samples of the cleaned Finnish Wikipedia (August 2021) dataset.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e - 04
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: 8 - bit Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 10
- mixed_precision_training: Native AMP
The pretrained facebook/wav2vec2-base-fi-voxpopuli-v2
model was initialized with the following hyperparameters:
- attention_dropout: 0.094
- hidden_dropout: 0.047
- feat_proj_dropout: 0.04
- mask_time_prob: 0.082
- layerdrop: 0.041
- activation_dropout: 0.055
- ctc_loss_reduction: "mean"
Training results
Training Loss | Epoch | Step | Validation Loss | Wer |
---|---|---|---|---|
1.575 | 0.33 | 500 | 0.7454 | 0.7048 |
0.5838 | 0.66 | 1000 | 0.2377 | 0.2608 |
0.5692 | 1.0 | 1500 | 0.2014 | 0.2244 |
0.5112 | 1.33 | 2000 | 0.1885 | 0.2013 |
0.4857 | 1.66 | 2500 | 0.1881 | 0.2120 |
0.4821 | 1.99 | 3000 | 0.1603 | 0.1894 |
0.4531 | 2.32 | 3500 | 0.1594 | 0.1865 |
0.4411 | 2.65 | 4000 | 0.1641 | 0.1874 |
0.4437 | 2.99 | 4500 | 0.1545 | 0.1874 |
0.4191 | 3.32 | 5000 | 0.1565 | 0.1770 |
0.4158 | 3.65 | 5500 | 0.1696 | 0.1867 |
0.4032 | 3.98 | 6000 | 0.1561 | 0.1746 |
0.4003 | 4.31 | 6500 | 0.1432 | 0.1749 |
0.4059 | 4.64 | 7000 | 0.1390 | 0.1690 |
0.4019 | 4.98 | 7500 | 0.1291 | 0.1646 |
0.3811 | 5.31 | 8000 | 0.1485 | 0.1755 |
0.3955 | 5.64 | 8500 | 0.1351 | 0.1659 |
0.3562 | 5.97 | 9000 | 0.1328 | 0.1614 |
0.3646 | 6.3 | 9500 | 0.1329 | 0.1584 |
0.351 | 6.64 | 10000 | 0.1342 | 0.1554 |
0.3408 | 6.97 | 10500 | 0.1422 | 0.1509 |
0.3562 | 7.3 | 11000 | 0.1309 | 0.1528 |
0.3335 | 7.63 | 11500 | 0.1305 | 0.1506 |
0.3491 | 7.96 | 12000 | 0.1365 | 0.1560 |
0.3538 | 8.29 | 12500 | 0.1293 | 0.1512 |
0.3338 | 8.63 | 13000 | 0.1328 | 0.1511 |
0.3509 | 8.96 | 13500 | 0.1304 | 0.1520 |
0.3431 | 9.29 | 14000 | 0.1360 | 0.1517 |
0.3309 | 9.62 | 14500 | 0.1328 | 0.1514 |
0.3252 | 9.95 | 15000 | 0.1316 | 0.1498 |
Framework versions
- Transformers 4.19.1
- Pytorch 1.11.0+cu102
- Datasets 2.2.1
- Tokenizers 0.11.0
Evaluation results
Evaluation was done with the Common Voice 7.0 Finnish test split, Common Voice 9.0 Finnish test split, and the FLEURS ASR Finnish test split.
Note that the training data of this model includes the training splits of Common Voice 9.0, while most of our previous models include Common Voice 7.0. So, we ran tests for both versions. However, Common Voice doesn't seem to fully preserve the test split between dataset versions, so there may be some overlap between the training and test splits of different versions. Thus, the test result comparisons between models trained with different Common Voice versions are not entirely accurate but still meaningful.
Common Voice 7.0 testing
To evaluate this model, run the eval.py
script in this repository:
python3 eval.py --model_id Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned --dataset mozilla-foundation/common_voice_7_0 --config fi --split test
This model (the first row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to our other models and their parameter counts:
Model parameters | WER (with LM) | WER (without LM) | CER (with LM) | CER (without LM) | |
---|---|---|---|---|---|
Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned | 95 million | 5.85 | 13.52 | 1.35 | 2.44 |
Finnish-NLP/wav2vec2-large-uralic-voxpopuli-v2-finnish | 300 million | 4.13 | 9.66 | 0.90 | 1.66 |
Finnish-NLP/wav2vec2-xlsr-300m-finnish-lm | 300 million | 8.16 | 17.92 | 1.97 | 3.36 |
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm | 1000 million | 5.65 | 13.11 | 1.20 | 2.23 |
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm-v2 | 1000 million | 4.09 | 9.73 | 0.88 | 1.65 |
Common Voice 9.0 testing
To evaluate this model, run the eval.py
script in this repository:
python3 eval.py --model_id Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned --dataset mozilla-foundation/common_voice_9_0 --config fi --split test
This model (the first row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to our other models and their parameter counts:
Model parameters | WER (with LM) | WER (without LM) | CER (with LM) | CER (without LM) | |
---|---|---|---|---|---|
Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned | 95 million | 5.93 | 14.08 | 1.40 | 2.59 |
Finnish-NLP/wav2vec2-large-uralic-voxpopuli-v2-finnish | 300 million | 4.13 | 9.83 | 0.92 | 1.71 |
Finnish-NLP/wav2vec2-xlsr-300m-finnish-lm | 300 million | 7.42 | 16.45 | 1.79 | 3.07 |
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm | 1000 million | 5.35 | 13.00 | 1.14 | 2.20 |
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm-v2 | 1000 million | 3.72 | 8.96 | 0.80 | 1.52 |
FLEURS ASR testing
To evaluate this model, run the eval.py
script in this repository:
python3 eval.py --model_id Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned --dataset google/fleurs --config fi_fi --split test
This model (the first row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to our other models and their parameter counts:
Model parameters | WER (with LM) | WER (without LM) | CER (with LM) | CER (without LM) | |
---|---|---|---|---|---|
Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned | 95 million | 13.99 | ... | 6.07 | ... |
... | ... | ... | ... | ... | ... |
📄 License
This project is licensed under the Apache - 2.0 license.

