wav2vec2-base-fi-voxpopuli-v2-finetuned Open-source Model - Empowering Finnish Automatic Speech Recognition

Wav2vec2 Base Fi Voxpopuli V2 Finetuned

Developed by Finnish-NLP

A Finnish automatic speech recognition model fine-tuned based on facebook/wav2vec2-base-fi-voxpopuli-v2, trained with 276.7 hours of annotated data, supports KenLM language model decoding

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Finnish speech-to-text #Low word error rate (WER 5.93)#Parliament speech optimization

Downloads 64

Release Time : 5/14/2022

Model Overview

Speech-to-text model optimized for Finnish, performs excellently on test sets like Common Voice

Model Features

Efficient fine-tuning

Based on the VoxPopuli V2 pre-trained model, fine-tuned with 276.7 hours of Finnish data

Multi-dataset support

Incorporates 6 data sources including Common Voice, parliamentary meetings, and broadcast corpora

Language model enhancement

Includes a Finnish KenLM 5-gram language model to improve recognition accuracy

Lightweight deployment

Supports 8-bit Adam optimizer, suitable for resource-constrained environments

Model Capabilities

Finnish speech-to-text

Short audio transcription (≤20 seconds)

Speech recognition with language model

Use Cases

Speech transcription

Meeting minutes automation

Convert Finnish parliamentary meeting recordings into text records

WER 5.93% on parliamentary dataset

Voice assistant development

Provide voice interaction foundation for Finnish smart devices

CER 1.40% on Common Voice 9.0

EdTech

Language learning tools

Used for Finnish pronunciation assessment systems

🚀 Wav2Vec2-base-fi-voxpopuli-v2 for Finnish ASR

This acoustic model is a fine-tuned version of facebook/wav2vec2-base-fi-voxpopuli-v2 for Finnish Automatic Speech Recognition (ASR), offering high - quality speech - to - text conversion.

✨ Features

Fine - tuned for Finnish: Based on the pre - trained facebook/wav2vec2-base-fi-voxpopuli-v2, fine - tuned with 276.7 hours of Finnish transcribed speech data.
Including Language Model: The repository also includes a Finnish KenLM language model for use in the decoding phase with the acoustic model.

📚 Documentation

Model description

Wav2vec2-base-fi-voxpopuli-v2 is a pretrained model by Facebook AI for Finnish speech. It is pretrained on 14.2k hours of unlabeled Finnish speech from the VoxPopuli V2 dataset using the wav2vec 2.0 objective. This model is a fine - tuned version of the pretrained model for Finnish ASR.

Intended uses & limitations

How to use

Check the run-finnish-asr-models.ipynb notebook in this repository for a detailed example of how to use this model.

Limitations and bias

Audio length: This model was fine - tuned with audio samples of a maximum length of 20 seconds, so it likely works best for relatively short audios of similar length. However, you can try it with longer audios and see the performance. If you encounter out - of - memory errors with very long audio files, you can use the audio chunking method introduced in this blog post.
Data domain: A large portion of the fine - tuning data was from the Finnish Parliament dataset, so the model may not generalize well to very different domains such as common daily spoken Finnish with dialects.
Gender bias: The datasets' audios tend to be dominated by adult males, so the model may not perform as well for children's and women's speeches.
Language model: The Finnish KenLM language model used in the decoding phase was trained with text data from audio transcriptions and a subset of Finnish Wikipedia. Thus, it may not generalize to very different language types, such as spoken daily language with dialects. It may be beneficial to train your own KenLM language model for your specific domain.

Training data

This model was fine - tuned with 276.7 hours of Finnish transcribed speech data from the following datasets:

Dataset	Hours	% of total hours
Common Voice 9.0 Finnish train + evaluation + other splits	10.80 h	3.90 %
Finnish parliament session 2	0.24 h	0.09 %
VoxPopuli Finnish	21.97 h	7.94 %
CSS10 Finnish	10.32 h	3.73 %
Aalto Finnish Parliament ASR Corpus	228.00 h	82.40 %
Finnish Broadcast Corpus	5.37 h	1.94 %

The datasets were filtered to include audio samples with a maximum length of 20 seconds.

Training procedure

This model was trained on a Tesla V100 GPU, sponsored by Hugging Face & OVHcloud. The training script was provided by Hugging Face and is available here. We only modified its data loading for our custom datasets.

For the KenLM language model training, we followed the blog post tutorial provided by Hugging Face. The training data for the 5 - gram KenLM were text transcriptions of the audio training data and 100k random samples of the cleaned Finnish Wikipedia (August 2021) dataset.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e - 04
train_batch_size: 64
eval_batch_size: 64
seed: 42
optimizer: 8 - bit Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 10
mixed_precision_training: Native AMP

The pretrained facebook/wav2vec2-base-fi-voxpopuli-v2 model was initialized with the following hyperparameters:

attention_dropout: 0.094
hidden_dropout: 0.047
feat_proj_dropout: 0.04
mask_time_prob: 0.082
layerdrop: 0.041
activation_dropout: 0.055
ctc_loss_reduction: "mean"

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
1.575	0.33	500	0.7454	0.7048
0.5838	0.66	1000	0.2377	0.2608
0.5692	1.0	1500	0.2014	0.2244
0.5112	1.33	2000	0.1885	0.2013
0.4857	1.66	2500	0.1881	0.2120
0.4821	1.99	3000	0.1603	0.1894
0.4531	2.32	3500	0.1594	0.1865
0.4411	2.65	4000	0.1641	0.1874
0.4437	2.99	4500	0.1545	0.1874
0.4191	3.32	5000	0.1565	0.1770
0.4158	3.65	5500	0.1696	0.1867
0.4032	3.98	6000	0.1561	0.1746
0.4003	4.31	6500	0.1432	0.1749
0.4059	4.64	7000	0.1390	0.1690
0.4019	4.98	7500	0.1291	0.1646
0.3811	5.31	8000	0.1485	0.1755
0.3955	5.64	8500	0.1351	0.1659
0.3562	5.97	9000	0.1328	0.1614
0.3646	6.3	9500	0.1329	0.1584
0.351	6.64	10000	0.1342	0.1554
0.3408	6.97	10500	0.1422	0.1509
0.3562	7.3	11000	0.1309	0.1528
0.3335	7.63	11500	0.1305	0.1506
0.3491	7.96	12000	0.1365	0.1560
0.3538	8.29	12500	0.1293	0.1512
0.3338	8.63	13000	0.1328	0.1511
0.3509	8.96	13500	0.1304	0.1520
0.3431	9.29	14000	0.1360	0.1517
0.3309	9.62	14500	0.1328	0.1514
0.3252	9.95	15000	0.1316	0.1498

Framework versions

Transformers 4.19.1
Pytorch 1.11.0+cu102
Datasets 2.2.1
Tokenizers 0.11.0

Evaluation results

Evaluation was done with the Common Voice 7.0 Finnish test split, Common Voice 9.0 Finnish test split, and the FLEURS ASR Finnish test split.

Note that the training data of this model includes the training splits of Common Voice 9.0, while most of our previous models include Common Voice 7.0. So, we ran tests for both versions. However, Common Voice doesn't seem to fully preserve the test split between dataset versions, so there may be some overlap between the training and test splits of different versions. Thus, the test result comparisons between models trained with different Common Voice versions are not entirely accurate but still meaningful.

Common Voice 7.0 testing

To evaluate this model, run the eval.py script in this repository:

python3 eval.py --model_id Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned --dataset mozilla-foundation/common_voice_7_0 --config fi --split test

This model (the first row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to our other models and their parameter counts:

	Model parameters	WER (with LM)	WER (without LM)	CER (with LM)	CER (without LM)
Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned	95 million	5.85	13.52	1.35	2.44
Finnish-NLP/wav2vec2-large-uralic-voxpopuli-v2-finnish	300 million	4.13	9.66	0.90	1.66
Finnish-NLP/wav2vec2-xlsr-300m-finnish-lm	300 million	8.16	17.92	1.97	3.36
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm	1000 million	5.65	13.11	1.20	2.23
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm-v2	1000 million	4.09	9.73	0.88	1.65

Common Voice 9.0 testing

To evaluate this model, run the eval.py script in this repository:

python3 eval.py --model_id Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned --dataset mozilla-foundation/common_voice_9_0 --config fi --split test

This model (the first row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to our other models and their parameter counts:

	Model parameters	WER (with LM)	WER (without LM)	CER (with LM)	CER (without LM)
Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned	95 million	5.93	14.08	1.40	2.59
Finnish-NLP/wav2vec2-large-uralic-voxpopuli-v2-finnish	300 million	4.13	9.83	0.92	1.71
Finnish-NLP/wav2vec2-xlsr-300m-finnish-lm	300 million	7.42	16.45	1.79	3.07
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm	1000 million	5.35	13.00	1.14	2.20
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm-v2	1000 million	3.72	8.96	0.80	1.52

FLEURS ASR testing

To evaluate this model, run the eval.py script in this repository:

python3 eval.py --model_id Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned --dataset google/fleurs --config fi_fi --split test

This model (the first row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to our other models and their parameter counts:

	Model parameters	WER (with LM)	WER (without LM)	CER (with LM)	CER (without LM)
Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned	95 million	13.99	...	6.07	...
...	...	...	...	...	...

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご