The open-source wav2vec2-xlsr-1b-finnish-v2 model - Efficiently achieve automatic speech recognition for Finnish

Wav2vec2 Xlsr 1b Finnish V2

Developed by aapot

A Finnish automatic speech recognition model fine-tuned based on facebook/wav2vec2-xls-r-1b, trained with 275.6 hours of Finnish annotated data

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Finnish speech recognition #High precision WER9.73 #Large parameter XLS-R

Downloads 15

Release Time : 3/2/2022

Model Overview

Speech-to-text model optimized for Finnish, suitable for short audio transcription tasks

Model Features

Large-scale pre-training foundation

Fine-tuned from a 1-billion parameter model pre-trained on 436,000 hours of multilingual speech

Efficient fine-tuning

Parameter-efficient fine-tuning using 8-bit Adam optimizer

Multi-source training data

Combines 6 Finnish datasets including parliamentary recordings, broadcasts, and Common Voice

Low character error rate

Achieves 1.65% character error rate on Common Voice test set

Model Capabilities

Finnish speech recognition

Short audio transcription

Speech content to text

Use Cases

Speech transcription

Meeting minutes automation

Convert Finnish meeting recordings into text transcripts

Word error rate 9.73% (without language model)

Media content subtitle generation

Generate subtitles for Finnish videos/broadcast programs

Character error rate 1.65%

Voice assistant

Finnish voice command recognition

Supports voice interaction for Finnish smart devices

🚀 Wav2Vec2 XLS-R for Finnish ASR

This acoustic model is designed for Finnish Automatic Speech Recognition (ASR). It's a fine - tuned version of facebook/wav2vec2-xls-r-1b, trained with 275.6 hours of Finnish transcribed speech data. Wav2Vec2 XLS - R was introduced in this paper and first released at this page.

Note: There's a version using the KenLM language model in the decoding phase, which produces better transcriptions: Finnish - NLP/wav2vec2-xlsr-1b-finnish-lm-v2

✨ Features

Multilingual Pretraining: Wav2Vec2 XLS - R is a large - scale multilingual pretrained model for speech, pretrained on 436k hours of unlabeled speech across 128 languages, including VoxPopuli, MLS, CommonVoice, BABEL, and VoxLingua107.
Fine - Tuned for Finnish: Specifically fine - tuned for Finnish ASR using a diverse set of Finnish speech datasets.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model description

Wav2Vec2 XLS - R is Facebook AI's large - scale multilingual pretrained model for speech. It uses the wav2vec 2.0 objective and is pretrained on a vast amount of unlabeled speech data from multiple sources. This particular model is a fine - tuned variant of the 1 - billion - parameter version for Finnish ASR.

You can read more about the pretrained model from this blog and this paper.

Intended uses & limitations

How to use

Check the run - finnish - asr - models.ipynb notebook in this repository for a detailed example on how to use this model.

Limitations and bias

Audio Length: The model was fine - tuned with audio samples of maximum 20 seconds. It likely works best for short audios of similar length. For very long audio files, you can use the audio chunking method introduced in [this blog post](https://huggingface.co/blog/asr - chunking) if you encounter out - of - memory errors.
Domain Generalization: A large portion of the fine - tuning data was from the Finnish Parliament dataset. So, the model may not generalize well to different domains such as common daily spoken Finnish with dialects.
Gender and Age Bias: The datasets' audios are mostly from adult males. Thus, the model may not perform as well for children and women's speeches.

Training data

This model was fine - tuned with 275.6 hours of Finnish transcribed speech data from the following datasets:

Property	Details
[Common Voice 7.0 Finnish train + evaluation + other splits](https://huggingface.co/datasets/mozilla - foundation/common_voice_7_0)	9.70 h (3.52%)
Finnish parliament session 2	0.24 h (0.09%)
VoxPopuli Finnish	21.97 h (7.97%)
CSS10 Finnish	10.32 h (3.74%)
[Aalto Finnish Parliament ASR Corpus](http://urn.fi/urn:nbn:fi:lb - 2021051903)	228.00 h (82.73%)
[Finnish Broadcast Corpus](http://urn.fi/urn:nbn:fi:lb - 2016042502)	5.37 h (1.95%)

The datasets were filtered to include audio samples with a maximum length of 20 seconds.

Training procedure

This model was trained during the [Robust Speech Challenge Event](https://discuss.huggingface.co/t/open - to - the - community - robust - speech - recognition - challenge/13614) organized by Hugging Face. Training was done on a Tesla V100 GPU, sponsored by OVHcloud.

The training script was provided by Hugging Face and is available [here](https://github.com/huggingface/transformers/blob/main/examples/research_projects/robust - speech - event/run_speech_recognition_ctc_bnb.py). Only the data loading was modified for custom datasets.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 32
eval_batch_size: 8
seed: 42
optimizer: 8 - bit Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 10
mixed_precision_training: Native AMP

The pretrained facebook/wav2vec2 - xls - r - 1b model was initialized with the following hyperparameters:

attention_dropout: 0.094
hidden_dropout: 0.047
feat_proj_dropout: 0.04
mask_time_prob: 0.082
layerdrop: 0.041
activation_dropout: 0.055
ctc_loss_reduction: "mean"

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.7778	0.17	500	0.2851	0.3572
0.5506	0.34	1000	0.1595	0.2130
0.6569	0.5	1500	0.1458	0.2046
0.5997	0.67	2000	0.1374	0.1975
0.542	0.84	2500	0.1390	0.1956
0.4815	1.01	3000	0.1266	0.1813
0.6982	1.17	3500	0.1441	0.1965
0.4522	1.34	4000	0.1232	0.1822
0.4655	1.51	4500	0.1209	0.1702
0.4069	1.68	5000	0.1149	0.1688
0.4226	1.84	5500	0.1121	0.1560
0.3993	2.01	6000	0.1091	0.1557
0.406	2.18	6500	0.1115	0.1553
0.4098	2.35	7000	0.1144	0.1560
0.3995	2.51	7500	0.1028	0.1476
0.4101	2.68	8000	0.1129	0.1511
0.3636	2.85	8500	0.1025	0.1517
0.3534	3.02	9000	0.1068	0.1480
0.3836	3.18	9500	0.1072	0.1459
0.3531	3.35	10000	0.0928	0.1367
0.3649	3.52	10500	0.1042	0.1426
0.3645	3.69	11000	0.0979	0.1433
0.3685	3.85	11500	0.0947	0.1346
0.3325	4.02	12000	0.0991	0.1352
0.3497	4.19	12500	0.0919	0.1358
0.3303	4.36	13000	0.0888	0.1272
0.3323	4.52	13500	0.0888	0.1277
0.3452	4.69	14000	0.0894	0.1279
0.337	4.86	14500	0.0917	0.1289
0.3114	5.03	15000	0.0942	0.1313
0.3099	5.19	15500	0.0902	0.1239
0.3079	5.36	16000	0.0871	0.1256
0.3293	5.53	16500	0.0861	0.1263
0.3123	5.7	17000	0.0876	0.1203
0.3093	5.86	17500	0.0848	0.1226
0.2903	6.03	18000	0.0914	0.1221
0.297	6.2	18500	0.0841	0.1185
0.2797	6.37	19000	0.0858	0.1165
0.2878	6.53	19500	0.0874	0.1161
0.2974	6.7	20000	0.0835	0.1173
0.3051	6.87	20500	0.0835	0.1178
0.2941	7.04	21000	0.0852	0.1155
0.258	7.21	21500	0.0832	0.1132
0.2778	7.37	22000	0.0829	0.1110
0.2751	7.54	22500	0.0822	0.1069
0.2887	7.71	23000	0.0819	0.1103
0.2509	7.88	23500	0.0787	0.1055
0.2501	8.04	24000	0.0807	0.1076
0.2399	8.21	24500	0.0784	0.1052
0.2539	8.38	25000	0.0772	0.1075
0.248	8.55	25500	0.0772	0.1055
0.2689	8.71	26000	0.0763	0.1027
0.2855	8.88	26500	0.0756	0.1035
0.2421	9.05	27000	0.0771	0.0998
0.2497	9.22	27500	0.0756	0.0971
0.2367	9.38	28000	0.0741	0.0974
0.2473	9.55	28500	0.0739	0.0982
0.2396	9.72	29000	0.0756	0.0991
0.2602	9.89	29500	0.0737	0.0975

Framework versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.3
Tokenizers 0.11.0

Evaluation results

Evaluation was done with the [Common Voice 7.0 Finnish test split](https://huggingface.co/datasets/mozilla - foundation/common_voice_7_0).

To evaluate this model, run the eval.py script in this repository:

python3 eval.py --model_id aapot/wav2vec2-xlsr-1b-finnish-v2 --dataset mozilla-foundation/common_voice_7_0 --config fi --split test

This model (the first row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to other models:

Model	WER (with LM)	WER (without LM)	CER (with LM)	CER (without LM)
aapot/wav2vec2-xlsr-1b-finnish-lm-v2	4.09	9.73	0.88	1.65
aapot/wav2vec2-xlsr-1b-finnish-lm	5.65	13.11	1.20	2.23
aapot/wav2vec2-xlsr-300m-finnish-lm	8.16	17.92	1.97	3.36

Team Members

Aapo Tanskanen, Hugging Face profile, LinkedIn profile
Rasmus Toivanen, Hugging Face profile, LinkedIn profile

Feel free to contact us for more details 🤗

🔧 Technical Details

The model is based on the Wav2Vec2 XLS - R architecture, which is a multilingual pretrained model for speech. The fine - tuning process involves adjusting the model's parameters to better fit the Finnish ASR task using a combination of hyperparameters and specific datasets. The use of 8 - bit Adam optimizer and Native AMP for mixed - precision training helps in efficient training on a Tesla V100 GPU.

📄 License

The model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご