Model Overview
Model Features
Model Capabilities
Use Cases
đ Wav2vec2-xls-r-1b for Finnish ASR
This acoustic model is a fine - tuned version of facebook/wav2vec2-xls-r-1b for Finnish Automatic Speech Recognition (ASR). It has been fine - tuned using 259.57 hours of Finnish transcribed speech data. Wav2Vec2 XLS - R was introduced in this paper and first released at this page.
This repository also includes a Finnish KenLM language model used in the decoding phase with the acoustic model.
Note: this model is identical to the aapot/wav2vec2-xlsr-1b-finnish-lm model; it has just been copied/moved to this Finnish - NLP
Hugging Face organization.
Note: there is a better V2 version of this model, Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm-v2, which has been fine - tuned for a longer time with an additional 16 hours of data.
đ Quick Start
This model is designed for Finnish ASR. To see a detailed example of how to use it, check the run-finnish-asr-models.ipynb notebook in this repository.
⨠Features
- Fine - tuned for Finnish: Specifically optimized for Finnish ASR with a large amount of Finnish transcribed speech data.
- Included Language Model: Comes with a Finnish KenLM language model for the decoding phase.
đĻ Installation
No installation steps were provided in the original README, so this section is skipped.
đģ Usage Examples
Basic Usage
To evaluate this model on the Common Voice 7.0 dataset, run the following command:
python3 eval.py --model_id Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm --dataset mozilla-foundation/common_voice_7_0 --config fi --split test
đ Documentation
Model Description
Wav2Vec2 XLS - R is a large - scale multilingual pretrained model for speech developed by Facebook AI. It is pretrained on 436k hours of unlabeled speech from various sources such as VoxPopuli, MLS, CommonVoice, BABEL, and VoxLingua107. It uses the wav2vec 2.0 objective across 128 languages.
This model is a fine - tuned version of the pretrained model (1 billion parameter variant) for Finnish ASR. You can read more about the pretrained model from this blog and this paper.
Intended Uses & Limitations
Intended Use
You can use this model for Finnish ASR (speech - to - text) tasks.
Limitations and Bias
- Audio Length: This model was fine - tuned with audio samples of maximum 20 seconds in length. It likely performs best on short audios of similar length, but you can also try it on longer audios. If you encounter out - of - memory errors with very long audio files, you can use the audio chunking method introduced in [this blog post](https://huggingface.co/blog/asr - chunking).
- Data Domain: A large portion of the fine - tuning data was from the Finnish Parliament dataset. So, the model may not generalize well to different domains like daily spoken Finnish with dialects.
- Gender and Age Bias: The datasets' audios are mostly from adult males. Thus, the model may not work as well for children's and women's speeches.
- Language Model Generalization: The Finnish KenLM language model used in the decoding phase was trained with text data from audio transcriptions. It may not generalize well to different language styles, such as daily spoken language with dialects. It might be beneficial to train your own KenLM language model for your specific domain.
Training Data
This model was fine - tuned with 259.57 hours of Finnish transcribed speech data from the following datasets:
Dataset | Hours | % of total hours |
---|---|---|
Common Voice 7.0 Finnish train + evaluation + other splits | 9.70 h | 3.74 % |
Finnish parliament session 2 | 0.24 h | 0.09 % |
VoxPopuli Finnish | 5.94 h | 2.29 % |
CSS10 Finnish | 10.32 h | 3.98 % |
Aalto Finnish Parliament ASR Corpus | 228.00 h | 87.84 % |
Finnish Broadcast Corpus | 5.37 h | 2.07 % |
The datasets were filtered to include audio samples with a maximum length of 20 seconds.
Training Procedure
This model was trained during the [Robust Speech Challenge Event](https://discuss.huggingface.co/t/open - to - the - community - robust - speech - recognition - challenge/13614) organized by Hugging Face. The training was conducted on a Tesla V100 GPU sponsored by OVHcloud.
The training script was provided by Hugging Face and is available [here](https://github.com/huggingface/transformers/blob/main/examples/research_projects/robust - speech - event/run_speech_recognition_ctc_bnb.py). Only the data loading was modified for custom datasets.
For the KenLM language model training, the [blog post tutorial](https://huggingface.co/blog/wav2vec2 - with - ngram) provided by Hugging Face was followed. The training data for the 5 - gram KenLM was the text transcriptions of the audio training data.
Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e - 05
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- optimizer: 8 - bit Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 5
- mixed_precision_training: Native AMP
The pretrained facebook/wav2vec2-xls-r-1b
model was initialized with the following hyperparameters:
- attention_dropout: 0.094
- hidden_dropout: 0.047
- feat_proj_dropout: 0.04
- mask_time_prob: 0.082
- layerdrop: 0.041
- activation_dropout: 0.055
- ctc_loss_reduction: "mean"
Training Results
Training Loss | Epoch | Step | Validation Loss | Wer |
---|---|---|---|---|
0.968 | 0.18 | 500 | 0.4870 | 0.4720 |
0.6557 | 0.36 | 1000 | 0.2450 | 0.2931 |
0.647 | 0.54 | 1500 | 0.1818 | 0.2255 |
0.5297 | 0.72 | 2000 | 0.1698 | 0.2354 |
0.5802 | 0.9 | 2500 | 0.1581 | 0.2355 |
0.6351 | 1.07 | 3000 | 0.1689 | 0.2336 |
0.4626 | 1.25 | 3500 | 0.1719 | 0.3099 |
0.4526 | 1.43 | 4000 | 0.1434 | 0.2069 |
0.4692 | 1.61 | 4500 | 0.1645 | 0.2192 |
0.4584 | 1.79 | 5000 | 0.1483 | 0.1987 |
0.4234 | 1.97 | 5500 | 0.1499 | 0.2178 |
0.4243 | 2.15 | 6000 | 0.1345 | 0.2070 |
0.4108 | 2.33 | 6500 | 0.1383 | 0.1850 |
0.4048 | 2.51 | 7000 | 0.1338 | 0.1811 |
0.4085 | 2.69 | 7500 | 0.1290 | 0.1780 |
0.4026 | 2.87 | 8000 | 0.1239 | 0.1650 |
0.4033 | 3.04 | 8500 | 0.1346 | 0.1657 |
0.3986 | 3.22 | 9000 | 0.1310 | 0.1850 |
0.3867 | 3.4 | 9500 | 0.1273 | 0.1741 |
0.3658 | 3.58 | 10000 | 0.1219 | 0.1672 |
0.382 | 3.76 | 10500 | 0.1306 | 0.1698 |
0.3847 | 3.94 | 11000 | 0.1230 | 0.1577 |
0.3691 | 4.12 | 11500 | 0.1310 | 0.1615 |
0.3593 | 4.3 | 12000 | 0.1296 | 0.1622 |
0.3619 | 4.48 | 12500 | 0.1285 | 0.1601 |
0.3361 | 4.66 | 13000 | 0.1261 | 0.1569 |
0.3603 | 4.84 | 13500 | 0.1235 | 0.1533 |
Framework Versions
- Transformers 4.17.0.dev0
- Pytorch 1.10.2+cu102
- Datasets 1.18.3
- Tokenizers 0.11.0
Evaluation Results
Evaluation was performed on the Common Voice 7.0 Finnish test split, Common Voice 9.0 Finnish test split, and the FLEURS ASR Finnish test split.
This model's training data includes the training splits of Common Voice 7.0, while newer models Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned
and Finnish-NLP/wav2vec2-large-uralic-voxpopuli-v2-finnish
include Common Voice 9.0. Tests were run for both Common Voice versions. Note that Common Voice may not fully preserve the test split between dataset versions, so comparisons between models trained with different Common Voice versions are not entirely accurate but still meaningful.
Common Voice 7.0 Testing
To evaluate this model, run the eval.py
script in this repository:
python3 eval.py --model_id Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm --dataset mozilla-foundation/common_voice_7_0 --config fi --split test
This model (the fourth row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to other models and their parameter counts:
Model parameters | WER (with LM) | WER (without LM) | CER (with LM) | CER (without LM) | |
---|---|---|---|---|---|
Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned | 95 million | 5.85 | 13.52 | 1.35 | 2.44 |
Finnish-NLP/wav2vec2-large-uralic-voxpopuli-v2-finnish | 300 million | 4.13 | 9.66 | 0.90 | 1.66 |
Finnish-NLP/wav2vec2-xlsr-300m-finnish-lm | 300 million | 8.16 | 17.92 | 1.97 | 3.36 |
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm | 1000 million | 5.65 | 13.11 | 1.20 | 2.23 |
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm-v2 | 1000 million | 4.09 | 9.73 | 0.88 | 1.65 |
Common Voice 9.0 Testing
To evaluate this model, run the eval.py
script in this repository:
python3 eval.py --model_id Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm --dataset mozilla-foundation/common_voice_9_0 --config fi --split test
This model (the fourth row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to other models and their parameter counts:
Model parameters | WER (with LM) | WER (without LM) | CER (with LM) | CER (without LM) | |
---|---|---|---|---|---|
Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned | 95 million | 5.93 | 14.08 | 1.40 | 2.59 |
Finnish-NLP/wav2vec2-large-uralic-voxpopuli-v2-finnish | 300 million | 4.13 | 9.83 | 0.92 | 1.71 |
Finnish-NLP/wav2vec2-xlsr-300m-finnish-lm | 300 million | 7.42 | 16.45 | 1.79 | 3.07 |
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm | 1000 million | 5.35 | 13.00 | 1.14 | 2.20 |
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm-v2 | 1000 million | 3.72 | 8.96 | 0.80 | 1.52 |
FLEURS ASR Testing
The evaluation command for the FLEURS ASR dataset was not fully provided in the original README.
đ§ Technical Details
Model Architecture
The model is based on the Wav2Vec2 XLS - R architecture, which is a powerful multilingual speech model. The fine - tuning process adapts this general - purpose architecture to the Finnish language for ASR tasks.
Training Process
The fine - tuning was carried out with a large amount of Finnish transcribed speech data. The use of a Tesla V100 GPU and specific hyperparameters ensured efficient training. The KenLM language model was trained separately using text transcriptions of the audio training data.
đ License
This model is licensed under the Apache 2.0 license.

