wav2vec2-xlsr-1b-finnish Open Source Model - Free Support for Finnish Automatic Speech Recognition Applications

Wav2vec2 Xlsr 1b Finnish

Developed by aapot

A fine-tuned version of Facebook's wav2vec2-xls-r-1b model for Finnish automatic speech recognition (ASR), trained with 259.57 hours of annotated Finnish speech data

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Finnish speech recognition #XLS-R large model #Parliament scenario optimization

Downloads 18

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition model optimized for Finnish, suitable for converting Finnish speech to text.

Model Features

Large-scale pretraining foundation

Based on the wav2vec2-xls-r-1b model pretrained with 436,000 hours of multilingual speech data

Finnish optimization

Specifically fine-tuned with 259.57 hours of annotated Finnish data

Efficient training

Uses 8-bit Adam optimizer and mixed-precision training

Language model support

Provides an improved version combined with KenLM language model

Model Capabilities

Finnish speech recognition

Short audio transcription (up to 20 seconds)

Speech-to-text

Use Cases

Speech transcription

Meeting minutes transcription

Convert speech from formal settings like Finnish parliamentary sessions into text

Performs well on parliamentary datasets

Voice assistant

Provide speech recognition capabilities for Finnish voice assistants

Speech analysis

Speech content analysis

Analyze content from Finnish broadcasts or podcasts

🚀 Wav2Vec2 XLS-R for Finnish ASR

This acoustic model is a fine - tuned version of facebook/wav2vec2-xls-r-1b for Finnish Automatic Speech Recognition (ASR). It has been fine - tuned using 259.57 hours of Finnish transcribed speech data. Wav2Vec2 XLS - R was introduced in this paper and first released at this page.

Note: There is a version with a KenLM language model used in the decoding phase, which produces better transcriptions: Finnish - NLP/wav2vec2-xlsr-1b-finnish-lm.

Note: There is a better V2 version of this model, which has been fine - tuned for a longer time with 16 more hours of data: Finnish - NLP/wav2vec2-xlsr-1b-finnish-lm-v2.

✨ Features

Fine - tuned from facebook/wav2vec2-xls-r-1b for Finnish ASR.
Trained with 259.57 hours of Finnish transcribed speech data.
There are alternative versions with language models and a more fine - tuned V2 version.

📚 Documentation

Model description

Wav2Vec2 XLS - R is a large - scale multilingual pretrained model for speech developed by Facebook AI. It is pretrained on 436k hours of unlabeled speech, including data from VoxPopuli, MLS, CommonVoice, BABEL, and VoxLingua107. It uses the wav2vec 2.0 objective and supports 128 languages.

You can read more about the pretrained model from this blog and this paper.

This model is a fine - tuned version of the pretrained model (1 billion parameter variant) for Finnish ASR.

Intended uses & limitations

How to use

Check the run - finnish - asr - models.ipynb notebook in this repository for a detailed example of how to use this model.

Limitations and bias

This model was fine - tuned with audio samples with a maximum length of 20 seconds. So, it most likely works best for relatively short audios of similar length. However, you can also try it with much longer audios and see how it performs. If you encounter out - of - memory errors with very long audio files, you can use the audio chunking method introduced in [this blog post](https://huggingface.co/blog/asr - chunking).

A vast majority of the data used for fine - tuning was from the Finnish Parliament dataset. So, this model may not generalize well to very different domains, such as common daily spoken Finnish with dialects. In addition, the datasets' audios tend to be dominated by adult males. So, this model may not work as well for the speeches of children and women, for example.

Training data

This model was fine - tuned with 259.57 hours of Finnish transcribed speech data from the following datasets:

Dataset	Hours	% of total hours
[Common Voice 7.0 Finnish train + evaluation + other splits](https://huggingface.co/datasets/mozilla - foundation/common_voice_7_0)	9.70 h	3.74 %
Finnish parliament session 2	0.24 h	0.09 %
VoxPopuli Finnish	5.94 h	2.29 %
CSS10 Finnish	10.32 h	3.98 %
[Aalto Finnish Parliament ASR Corpus](http://urn.fi/urn:nbn:fi:lb - 2021051903)	228.00 h	87.84 %
[Finnish Broadcast Corpus](http://urn.fi/urn:nbn:fi:lb - 2016042502)	5.37 h	2.07 %

The datasets were filtered to include audio samples with a maximum length of 20 seconds.

Training procedure

This model was trained during the [Robust Speech Challenge Event](https://discuss.huggingface.co/t/open - to - the - community - robust - speech - recognition - challenge/13614) organized by Hugging Face. The training was conducted on a Tesla V100 GPU, sponsored by OVHcloud.

The training script was provided by Hugging Face and is available [here](https://github.com/huggingface/transformers/blob/main/examples/research_projects/robust - speech - event/run_speech_recognition_ctc_bnb.py). We only modified its data loading for our custom datasets.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 32
eval_batch_size: 8
seed: 42
optimizer: 8 - bit Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 5
mixed_precision_training: Native AMP

The pretrained facebook/wav2vec2-xls-r-1b model was initialized with the following hyperparameters:

attention_dropout: 0.094
hidden_dropout: 0.047
feat_proj_dropout: 0.04
mask_time_prob: 0.082
layerdrop: 0.041
activation_dropout: 0.055
ctc_loss_reduction: "mean"

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.968	0.18	500	0.4870	0.4720
0.6557	0.36	1000	0.2450	0.2931
0.647	0.54	1500	0.1818	0.2255
0.5297	0.72	2000	0.1698	0.2354
0.5802	0.9	2500	0.1581	0.2355
0.6351	1.07	3000	0.1689	0.2336
0.4626	1.25	3500	0.1719	0.3099
0.4526	1.43	4000	0.1434	0.2069
0.4692	1.61	4500	0.1645	0.2192
0.4584	1.79	5000	0.1483	0.1987
0.4234	1.97	5500	0.1499	0.2178
0.4243	2.15	6000	0.1345	0.2070
0.4108	2.33	6500	0.1383	0.1850
0.4048	2.51	7000	0.1338	0.1811
0.4085	2.69	7500	0.1290	0.1780
0.4026	2.87	8000	0.1239	0.1650
0.4033	3.04	8500	0.1346	0.1657
0.3986	3.22	9000	0.1310	0.1850
0.3867	3.4	9500	0.1273	0.1741
0.3658	3.58	10000	0.1219	0.1672
0.382	3.76	10500	0.1306	0.1698
0.3847	3.94	11000	0.1230	0.1577
0.3691	4.12	11500	0.1310	0.1615
0.3593	4.3	12000	0.1296	0.1622
0.3619	4.48	12500	0.1285	0.1601
0.3361	4.66	13000	0.1261	0.1569
0.3603	4.84	13500	0.1235	0.1533

Framework versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.3
Tokenizers 0.11.0

Evaluation results

Evaluation was conducted using the [Common Voice 7.0 Finnish test split](https://huggingface.co/datasets/mozilla - foundation/common_voice_7_0).

To evaluate this model, run the eval.py script in this repository:

python3 eval.py --model_id aapot/wav2vec2-xlsr-1b-finnish --dataset mozilla-foundation/common_voice_7_0 --config fi --split test

This model (the second row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to our other models:

	WER (with LM)	WER (without LM)	CER (with LM)	CER (without LM)
aapot/wav2vec2-xlsr-1b-finnish-lm-v2	4.09	9.73	0.88	1.65
aapot/wav2vec2-xlsr-1b-finnish-lm	5.65	13.11	1.20	2.23
aapot/wav2vec2-xlsr-300m-finnish-lm	8.16	17.92	1.97	3.36

👥 Team Members

Aapo Tanskanen, Hugging Face profile, LinkedIn profile
Rasmus Toivanen, Hugging Face profile, LinkedIn profile

Feel free to contact us for more details 🤗

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご