Open-source wav2vec2-xlsr-1b-finnish-lm model - Achieve accurate Finnish speech-to-text conversion with free deployment

Wav2vec2 Xlsr 1b Finnish Lm

Developed by Finnish-NLP

A Finnish automatic speech recognition model fine-tuned based on facebook/wav2vec2-xls-r-1b, trained with 259.57 hours of annotated Finnish speech data, supporting Finnish speech-to-text tasks.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Finnish speech recognition #High accuracy WER5.65 #Parliament speech optimization

Downloads 32

Release Time : 3/28/2022

Model Overview

This is an automatic speech recognition model optimized for Finnish, fine-tuned based on the 1-billion-parameter Wav2Vec2 XLS-R architecture, suitable for short audio transcription. Includes a Finnish KenLM language model to enhance decoding performance.

Model Features

Large-scale pre-training foundation

Based on the XLS-R architecture pre-trained with 436,000 hours of multilingual speech data, featuring powerful acoustic feature extraction capabilities.

Domain-adapted fine-tuning

Fine-tuned with 259 hours of Finnish data, specifically optimized for parliamentary speeches and broadcast speech scenarios.

Language model enhancement

Includes a 5-gram KenLM language model, significantly improving transcription accuracy.

Efficient inference

Supports direct processing of 20-second short audio, with long audio processed via chunking methods.

Model Capabilities

Finnish speech recognition

Short audio transcription

Decoding with language model

Use Cases

Speech transcription

Parliament meeting minutes

Transcribing Finnish parliamentary speeches

Performs excellently on the Aalto Parliament dataset

Broadcast content transcription

Processing Finnish radio program audio

Achieves WER 5.65% on broadcast corpus

Educational applications

Language learning assistance

Helping learners correct Finnish pronunciation

🚀 Wav2vec2-xls-r-1b for Finnish ASR

This acoustic model is a fine - tuned version of facebook/wav2vec2-xls-r-1b for Finnish Automatic Speech Recognition (ASR). It has been fine - tuned using 259.57 hours of Finnish transcribed speech data. Wav2Vec2 XLS - R was introduced in this paper and first released at this page.

This repository also includes a Finnish KenLM language model used in the decoding phase with the acoustic model.

Note: this model is identical to the aapot/wav2vec2-xlsr-1b-finnish-lm model; it has just been copied/moved to this Finnish - NLP Hugging Face organization.

Note: there is a better V2 version of this model, Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm-v2, which has been fine - tuned for a longer time with an additional 16 hours of data.

🚀 Quick Start

This model is designed for Finnish ASR. To see a detailed example of how to use it, check the run-finnish-asr-models.ipynb notebook in this repository.

✨ Features

Fine - tuned for Finnish: Specifically optimized for Finnish ASR with a large amount of Finnish transcribed speech data.
Included Language Model: Comes with a Finnish KenLM language model for the decoding phase.

📦 Installation

No installation steps were provided in the original README, so this section is skipped.

💻 Usage Examples

Basic Usage

To evaluate this model on the Common Voice 7.0 dataset, run the following command:

python3 eval.py --model_id Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm  --dataset mozilla-foundation/common_voice_7_0 --config fi --split test

📚 Documentation

Model Description

Wav2Vec2 XLS - R is a large - scale multilingual pretrained model for speech developed by Facebook AI. It is pretrained on 436k hours of unlabeled speech from various sources such as VoxPopuli, MLS, CommonVoice, BABEL, and VoxLingua107. It uses the wav2vec 2.0 objective across 128 languages.

This model is a fine - tuned version of the pretrained model (1 billion parameter variant) for Finnish ASR. You can read more about the pretrained model from this blog and this paper.

Intended Uses & Limitations

Intended Use

You can use this model for Finnish ASR (speech - to - text) tasks.

Limitations and Bias

Audio Length: This model was fine - tuned with audio samples of maximum 20 seconds in length. It likely performs best on short audios of similar length, but you can also try it on longer audios. If you encounter out - of - memory errors with very long audio files, you can use the audio chunking method introduced in [this blog post](https://huggingface.co/blog/asr - chunking).
Data Domain: A large portion of the fine - tuning data was from the Finnish Parliament dataset. So, the model may not generalize well to different domains like daily spoken Finnish with dialects.
Gender and Age Bias: The datasets' audios are mostly from adult males. Thus, the model may not work as well for children's and women's speeches.
Language Model Generalization: The Finnish KenLM language model used in the decoding phase was trained with text data from audio transcriptions. It may not generalize well to different language styles, such as daily spoken language with dialects. It might be beneficial to train your own KenLM language model for your specific domain.

Training Data

This model was fine - tuned with 259.57 hours of Finnish transcribed speech data from the following datasets:

Dataset	Hours	% of total hours
Common Voice 7.0 Finnish train + evaluation + other splits	9.70 h	3.74 %
Finnish parliament session 2	0.24 h	0.09 %
VoxPopuli Finnish	5.94 h	2.29 %
CSS10 Finnish	10.32 h	3.98 %
Aalto Finnish Parliament ASR Corpus	228.00 h	87.84 %
Finnish Broadcast Corpus	5.37 h	2.07 %

The datasets were filtered to include audio samples with a maximum length of 20 seconds.

Training Procedure

This model was trained during the [Robust Speech Challenge Event](https://discuss.huggingface.co/t/open - to - the - community - robust - speech - recognition - challenge/13614) organized by Hugging Face. The training was conducted on a Tesla V100 GPU sponsored by OVHcloud.

The training script was provided by Hugging Face and is available [here](https://github.com/huggingface/transformers/blob/main/examples/research_projects/robust - speech - event/run_speech_recognition_ctc_bnb.py). Only the data loading was modified for custom datasets.

For the KenLM language model training, the [blog post tutorial](https://huggingface.co/blog/wav2vec2 - with - ngram) provided by Hugging Face was followed. The training data for the 5 - gram KenLM was the text transcriptions of the audio training data.

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 32
eval_batch_size: 8
seed: 42
optimizer: 8 - bit Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 5
mixed_precision_training: Native AMP

The pretrained facebook/wav2vec2-xls-r-1b model was initialized with the following hyperparameters:

attention_dropout: 0.094
hidden_dropout: 0.047
feat_proj_dropout: 0.04
mask_time_prob: 0.082
layerdrop: 0.041
activation_dropout: 0.055
ctc_loss_reduction: "mean"

Training Results

Training Loss	Epoch	Step	Validation Loss	Wer
0.968	0.18	500	0.4870	0.4720
0.6557	0.36	1000	0.2450	0.2931
0.647	0.54	1500	0.1818	0.2255
0.5297	0.72	2000	0.1698	0.2354
0.5802	0.9	2500	0.1581	0.2355
0.6351	1.07	3000	0.1689	0.2336
0.4626	1.25	3500	0.1719	0.3099
0.4526	1.43	4000	0.1434	0.2069
0.4692	1.61	4500	0.1645	0.2192
0.4584	1.79	5000	0.1483	0.1987
0.4234	1.97	5500	0.1499	0.2178
0.4243	2.15	6000	0.1345	0.2070
0.4108	2.33	6500	0.1383	0.1850
0.4048	2.51	7000	0.1338	0.1811
0.4085	2.69	7500	0.1290	0.1780
0.4026	2.87	8000	0.1239	0.1650
0.4033	3.04	8500	0.1346	0.1657
0.3986	3.22	9000	0.1310	0.1850
0.3867	3.4	9500	0.1273	0.1741
0.3658	3.58	10000	0.1219	0.1672
0.382	3.76	10500	0.1306	0.1698
0.3847	3.94	11000	0.1230	0.1577
0.3691	4.12	11500	0.1310	0.1615
0.3593	4.3	12000	0.1296	0.1622
0.3619	4.48	12500	0.1285	0.1601
0.3361	4.66	13000	0.1261	0.1569
0.3603	4.84	13500	0.1235	0.1533

Framework Versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.3
Tokenizers 0.11.0

Evaluation Results

Evaluation was performed on the Common Voice 7.0 Finnish test split, Common Voice 9.0 Finnish test split, and the FLEURS ASR Finnish test split.

This model's training data includes the training splits of Common Voice 7.0, while newer models Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned and Finnish-NLP/wav2vec2-large-uralic-voxpopuli-v2-finnish include Common Voice 9.0. Tests were run for both Common Voice versions. Note that Common Voice may not fully preserve the test split between dataset versions, so comparisons between models trained with different Common Voice versions are not entirely accurate but still meaningful.

Common Voice 7.0 Testing

To evaluate this model, run the eval.py script in this repository:

python3 eval.py --model_id Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm  --dataset mozilla-foundation/common_voice_7_0 --config fi --split test

This model (the fourth row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to other models and their parameter counts:

	Model parameters	WER (with LM)	WER (without LM)	CER (with LM)	CER (without LM)
Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned	95 million	5.85	13.52	1.35	2.44
Finnish-NLP/wav2vec2-large-uralic-voxpopuli-v2-finnish	300 million	4.13	9.66	0.90	1.66
Finnish-NLP/wav2vec2-xlsr-300m-finnish-lm	300 million	8.16	17.92	1.97	3.36
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm	1000 million	5.65	13.11	1.20	2.23
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm-v2	1000 million	4.09	9.73	0.88	1.65

Common Voice 9.0 Testing

To evaluate this model, run the eval.py script in this repository:

python3 eval.py --model_id Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm  --dataset mozilla-foundation/common_voice_9_0 --config fi --split test

This model (the fourth row of the table) achieves the following WER (Word Error Rate) and CER (Character Error Rate) results compared to other models and their parameter counts:

	Model parameters	WER (with LM)	WER (without LM)	CER (with LM)	CER (without LM)
Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned	95 million	5.93	14.08	1.40	2.59
Finnish-NLP/wav2vec2-large-uralic-voxpopuli-v2-finnish	300 million	4.13	9.83	0.92	1.71
Finnish-NLP/wav2vec2-xlsr-300m-finnish-lm	300 million	7.42	16.45	1.79	3.07
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm	1000 million	5.35	13.00	1.14	2.20
Finnish-NLP/wav2vec2-xlsr-1b-finnish-lm-v2	1000 million	3.72	8.96	0.80	1.52

FLEURS ASR Testing

The evaluation command for the FLEURS ASR dataset was not fully provided in the original README.

🔧 Technical Details

Model Architecture

The model is based on the Wav2Vec2 XLS - R architecture, which is a powerful multilingual speech model. The fine - tuning process adapts this general - purpose architecture to the Finnish language for ASR tasks.

Training Process

The fine - tuning was carried out with a large amount of Finnish transcribed speech data. The use of a Tesla V100 GPU and specific hyperparameters ensured efficient training. The KenLM language model was trained separately using text transcriptions of the audio training data.

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご