๐ Norwegian Wav2Vec2 Model - 300M - VoxRex - Nynorsk
This model is designed for automatic speech recognition in Nynorsk. It's finetuned on a feature extractor from the National Library of Sweden, achieving great results on the test set.
๐ Quick Start
To reproduce the results of this model, follow these steps:
- First, verify that you can train a simple Swedish model by following instructions from ๐ค.
- Create a new repo, then copy the files
run.sh
and run_speech_recognition_ctc.py
from our repo. Running these will generate all the necessary files for reproducing our results. You might even build a better ASR by tweaking the hyperparameters.
โจ Features
- High Performance: Achieves a WER of 0.1222 and a CER of 0.0419 on the test set with a 5 - gram KenLM.
- Efficient Training: The finetuning process is not very computationally demanding and can be completed in a few days on an average GPU.
- Language Model Support: Adding a simple 5 - gram language model can significantly improve the results.
๐ฆ Installation
To train your own model, you need to set up the environment and run the training script with the appropriate parameters. Here are the parameters used for the final model:
--dataset_name="NbAiLab/NPSC"
--model_name_or_path="KBLab/wav2vec2-large-voxrex"
--dataset_config_name="16K_mp3_nynorsk"
--output_dir="./"
--overwrite_output_dir
--num_train_epochs="80"
--per_device_train_batch_size="16"
--per_device_eval_batch_size="16"
--gradient_accumulation_steps="2"
--learning_rate="1e-4"
--warmup_steps="2000"
--length_column_name="input_length"
--evaluation_strategy="steps"
--text_column_name="text"
--save_steps="500"
--eval_steps="500"
--logging_steps="100"
--layerdrop="0.041"
--attention_dropout="0.094"
--activation_dropout="0.055"
--hidden_dropout="0.047"
--save_total_limit="3"
--freeze_feature_encoder
--feat_proj_dropout="0.04"
--mask_time_prob="0.082"
--mask_time_length="10"
--mask_feature_prob="0.25"
--mask_feature_length="64"
--gradient_checkpointing
--min_duration_in_seconds="0.5"
--max_duration_in_seconds="30.0"
--use_auth_token
--seed="42"
--fp16
--group_by_length
--do_train --do_eval
--push_to_hub
--preprocessing_num_workers="32"
๐ Documentation
Model Description
This is one of several Wav2Vec - models created by our team during the ๐ค hosted Robust Speech Event. Here is a list of our models and their final scores:
Dataset
The team converted the Norwegian Parliamentary Speech Corpus (NPSC) to the NbAiLab/NPSC in ๐ค Dataset format and used it as the main training source.
Team
The following people contributed to building this model: Rolv - Arild Braaten, Per Egil Kummervold, Andre Kรฅsen, Javier de la Rosa, Per Erik Solberg, and Freddy Wetjen.
Training Procedure
We recommend following the instructions from ๐ค to train a simple Swedish model first. Then, create a new repo and copy the necessary files from our repo to reproduce the results.
Language Model
Adding a simple 5 - gram language model can improve the results. ๐ค has a blog explaining how to add a 5 - gram language model to the ASR model. You can build the model from your own corpus or copy the 5 - gram model from this repo.
Parameters
Here are some comments on the training parameters:
Parameter |
Comment |
per_device_train_batch_size |
Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system. |
gradient_accumulation_steps |
Can be adjusted further up to increase batch size and speed up training without memory issues. |
learning_rate |
Can be increased, maybe as high as 1e - 4. Speeds up training but might add instability. |
epochs |
Can be decreased significantly. This is a large dataset and you might get decent results after a few epochs. |
๐ง Technical Details
The model is finetuned on top of the feature extractor VoxRex - model from the National Library of Sweden. The finetuning process uses the specified parameters to achieve the reported results.
๐ License
This model is released under the Apache - 2.0 license.
๐ Model Index
Property |
Details |
Model Name |
nb - wav2vec2 - 300m - nynorsk |
Task |
Automatic Speech Recognition |
Dataset |
NPSC (NbAiLab/NPSC with args 16K_mp3_nynorsk) |
Metrics |
Test (Nynorsk) WER: 0.1222; Test (Nynorsk) CER: 0.0419 |
๐ Citation
@inproceedings{de-la-rosa-etal-2023-boosting,
title = "Boosting {N}orwegian Automatic Speech Recognition",
author = "De La Rosa, Javier and
Braaten, Rolv-Arild and
Kummervold, Per and
Wetjen, Freddy",
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
month = may,
year = "2023",
address = "T{\'o}rshavn, Faroe Islands",
publisher = "University of Tartu Library",
url = "https://aclanthology.org/2023.nodalida-1.55",
pages = "555--564",
abstract = "In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10{\%} to 7.60{\%}, with models achieving 5.81{\%} for Bokm{\aa}l and 11.54{\%} for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.",
}
See https://arxiv.org/abs/2307.01672