Model Overview
Model Features
Model Capabilities
Use Cases
đ Norwegian Wav2Vec2 Model - 1B Nynorsk
This model is a fine - tuned version of the XLS - R feature extractor from Facebook/Meta. It achieves remarkable results on the test set with a 5 - gram KenLM. The values in parentheses represent the results without the language model.
- WER: 0.1132 (0.1364)
- CER: 0.0402 (---)
⨠Features
- Fine - Tuned on XLS - R: Built on top of the powerful XLS - R feature extractor from Facebook/Meta.
- High Performance: Achieves low Word Error Rate (WER) and Character Error Rate (CER) on the test set.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Model Description
This is one of several Wav2Vec models developed by our team during the đ¤ hosted Robust Speech Event. Here is a comprehensive list of our models and their final scores:
Model | Final WER |
---|---|
[NbAiLab/nb - wav2vec2 - 1b - bokmaal](https://huggingface.co/NbAiLab/nb - wav2vec2 - 1b - bokmaal) | 6.33 |
[NbAiLab/nb - wav2vec2 - 300m - bokmaal](https://huggingface.co/NbAiLab/nb - wav2vec2 - 300m - bokmaal) | 7.03 |
NbAiLab/nb - wav2vec2 - 1b - nynorsk (this model) | 11.32 |
[NbAiLab/nb - wav2vec2 - 300m - nynorsk](https://huggingface.co/NbAiLab/nb - wav2vec2 - 300m - nynorsk) | 12.22 |
Dataset
In parallel with the event, our team converted the [Norwegian Parliamentary Speech Corpus (NPSC)](https://www.nb.no/sprakbanken/en/resource - catalogue/oai - nb - no - sbr - 58/) into the NbAiLab/NPSC in đ¤ Dataset format, which served as the primary training source.
Code
We have made all the code developed during the event publicly available. This enables the Norwegian NLP community to build upon it for developing even better Norwegian Automatic Speech Recognition (ASR) models. The fine - tuning of these models is not overly computationally intensive. After following the instructions provided, you should be able to train your own ASR system in less than a day using an average GPU.
Team
The following individuals contributed to the development of this model: Rolv - Arild Braaten, Javier de la Rosa, and Freddy Wetjen.
Training Procedure
To reproduce our results, we strongly recommend following the [instructions from đ¤](https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust - speech - event#talks) to train a simple Swedish model.
Once you have verified your ability to do this, create a new repository. You can start by copying the files run.sh
and run_speech_recognition_ctc.py
from our repository. Running these files will generate all the necessary files and allow you to reproduce our results. By adjusting the hyperparameters, you may even be able to build a superior ASR model. Good luck!
Language Model
As the scores indicate, incorporating a simple 5 - gram language model can enhance the results. đ¤ has published another [helpful blog](https://huggingface.co/blog/wav2vec2 - with - ngram) explaining how to add a 5 - gram language model to improve the ASR model. You can build this model from your own corpus, for example, by extracting suitable text from the Norwegian Colossal Corpus. Alternatively, you can skip some steps in the guide and copy the [5 - gram model from this repo](https://huggingface.co/NbAiLab/XLSR - 300M - bokmaal/tree/main/language_model).
Parameters
The final model was trained using the following parameters:
--dataset_name="NbAiLab/NPSC"
--model_name_or_path="facebook/wav2vec2-xls-r-1b"
--dataset_config_name="16K_mp3_nynorsk"
--output_dir="./"
--overwrite_output_dir
--num_train_epochs="40"
--per_device_train_batch_size="12"
--per_device_eval_batch_size="12"
--gradient_accumulation_steps="2"
--learning_rate="2e-5"
--warmup_steps="2000"
--length_column_name="input_length"
--evaluation_strategy="steps"
--text_column_name="text"
--save_steps="500"
--eval_steps="500"
--logging_steps="100"
--layerdrop="0.041"
--attention_dropout="0.094"
--activation_dropout="0.055"
--hidden_dropout="0.047"
--save_total_limit="3"
--freeze_feature_encoder
--feat_proj_dropout="0.04"
--mask_time_prob="0.082"
--mask_time_length="10"
--mask_feature_prob="0.25"
--mask_feature_length="64"
--gradient_checkpointing
--min_duration_in_seconds="0.5"
--max_duration_in_seconds="30.0"
--ctc_zero_infinity=True
--use_auth_token
--seed="42"
--fp16
--group_by_length
--do_train --do_eval
--push_to_hub
--preprocessing_num_workers="16"
Using these settings, training may take 3 - 4 days on an average GPU. However, you can obtain a decent model more quickly by adjusting these parameters.
Parameter | Comment |
---|---|
per_device_train_batch_size | Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system |
gradient_accumulation_steps | Can be adjusted even further up to increase batch size and speed up training without running into memory issues |
learning_rate | Can be increased, maybe as high as 1e - 4. Speeds up training but might add instability |
epochs | Can be decreased significantly. This is a huge dataset and you might get a decent result already after a couple of epochs |
Citation
@inproceedings{de-la-rosa-etal-2023-boosting,
title = "Boosting {N}orwegian Automatic Speech Recognition",
author = "De La Rosa, Javier and
Braaten, Rolv-Arild and
Kummervold, Per and
Wetjen, Freddy",
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
month = may,
year = "2023",
address = "T{\'o}rshavn, Faroe Islands",
publisher = "University of Tartu Library",
url = "https://aclanthology.org/2023.nodalida-1.55",
pages = "555--564",
abstract = "In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10{\%} to 7.60{\%}, with models achieving 5.81{\%} for Bokm{\aa}l and 11.54{\%} for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.",
}
See https://arxiv.org/abs/2307.01672
đ§ Technical Details
The model is fine - tuned on top of the feature extractor XLS - R from Facebook/Meta. It uses a 5 - gram KenLM to improve the results on the test set. The training parameters are carefully selected to balance performance and training time.
đ License
This model is released under the Apache 2.0 license.
Property | Details |
---|---|
Model Type | Norwegian Wav2Vec2 Model - 1B Nynorsk |
Training Data | NbAiLab/NPSC |

