nb-wav2vec2-1b-nynorsk Open-source Automatic Speech Recognition Model - Accurately Recognize New Norwegian Speech

Nb Wav2vec2 1b Nynorsk

Developed by NbAiLab

A Nynorsk automatic speech recognition model fine-tuned based on Facebook/Meta's XLS-R feature extractor, achieving a WER of 11.32% on the NPSC test set.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Norwegian speech recognition #Nynorsk optimization #Low WER performance

Downloads 96.58k

Release Time : 6/9/2022

Model Overview

This is an automatic speech recognition model optimized for Nynorsk, fine-tuned on the 1B-parameter XLS-R architecture, suitable for speech-to-text tasks in Norwegian and Nynorsk.

Model Features

High-Performance Nynorsk Recognition

Achieves 11.32% WER on the NPSC test set, with even better performance when combined with a 5-gram language model.

Based on XLS-R Architecture

Fine-tuned on Facebook/Meta's 1B-parameter XLS-R feature extractor, featuring robust speech feature extraction capabilities.

Open-Source Training Process

Provides complete training code and parameter configurations for easy community reproduction and improvement.

Model Capabilities

Norwegian speech recognition

Nynorsk speech recognition

Long audio processing (up to 30 seconds)

Use Cases

Speech Transcription

Parliament Speech Transcription

Convert Norwegian parliamentary meeting recordings into text

Achieves 11.32% WER on the NPSC test set

Voice Assistants

Nynorsk Voice Command Recognition

Used for developing voice assistants that support Nynorsk

🚀 Norwegian Wav2Vec2 Model - 1B Nynorsk

This model is a fine - tuned version of the XLS - R feature extractor from Facebook/Meta. It achieves remarkable results on the test set with a 5 - gram KenLM. The values in parentheses represent the results without the language model.

WER: 0.1132 (0.1364)
CER: 0.0402 (---)

✨ Features

Fine - Tuned on XLS - R: Built on top of the powerful XLS - R feature extractor from Facebook/Meta.
High Performance: Achieves low Word Error Rate (WER) and Character Error Rate (CER) on the test set.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Description

This is one of several Wav2Vec models developed by our team during the 🤗 hosted Robust Speech Event. Here is a comprehensive list of our models and their final scores:

Model	Final WER
[NbAiLab/nb - wav2vec2 - 1b - bokmaal](https://huggingface.co/NbAiLab/nb - wav2vec2 - 1b - bokmaal)	6.33
[NbAiLab/nb - wav2vec2 - 300m - bokmaal](https://huggingface.co/NbAiLab/nb - wav2vec2 - 300m - bokmaal)	7.03
NbAiLab/nb - wav2vec2 - 1b - nynorsk (this model)	11.32
[NbAiLab/nb - wav2vec2 - 300m - nynorsk](https://huggingface.co/NbAiLab/nb - wav2vec2 - 300m - nynorsk)	12.22

Dataset

In parallel with the event, our team converted the [Norwegian Parliamentary Speech Corpus (NPSC)](https://www.nb.no/sprakbanken/en/resource - catalogue/oai - nb - no - sbr - 58/) into the NbAiLab/NPSC in 🤗 Dataset format, which served as the primary training source.

Code

We have made all the code developed during the event publicly available. This enables the Norwegian NLP community to build upon it for developing even better Norwegian Automatic Speech Recognition (ASR) models. The fine - tuning of these models is not overly computationally intensive. After following the instructions provided, you should be able to train your own ASR system in less than a day using an average GPU.

Team

The following individuals contributed to the development of this model: Rolv - Arild Braaten, Javier de la Rosa, and Freddy Wetjen.

Training Procedure

To reproduce our results, we strongly recommend following the [instructions from 🤗](https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust - speech - event#talks) to train a simple Swedish model.

Once you have verified your ability to do this, create a new repository. You can start by copying the files run.sh and run_speech_recognition_ctc.py from our repository. Running these files will generate all the necessary files and allow you to reproduce our results. By adjusting the hyperparameters, you may even be able to build a superior ASR model. Good luck!

Language Model

As the scores indicate, incorporating a simple 5 - gram language model can enhance the results. 🤗 has published another [helpful blog](https://huggingface.co/blog/wav2vec2 - with - ngram) explaining how to add a 5 - gram language model to improve the ASR model. You can build this model from your own corpus, for example, by extracting suitable text from the Norwegian Colossal Corpus. Alternatively, you can skip some steps in the guide and copy the [5 - gram model from this repo](https://huggingface.co/NbAiLab/XLSR - 300M - bokmaal/tree/main/language_model).

Parameters

The final model was trained using the following parameters:

--dataset_name="NbAiLab/NPSC"
--model_name_or_path="facebook/wav2vec2-xls-r-1b"
--dataset_config_name="16K_mp3_nynorsk"
--output_dir="./"
--overwrite_output_dir
--num_train_epochs="40"
--per_device_train_batch_size="12"
--per_device_eval_batch_size="12" 
--gradient_accumulation_steps="2" 
--learning_rate="2e-5" 
--warmup_steps="2000" 
--length_column_name="input_length" 
--evaluation_strategy="steps" 
--text_column_name="text" 
--save_steps="500" 
--eval_steps="500" 
--logging_steps="100" 
--layerdrop="0.041" 
--attention_dropout="0.094" 
--activation_dropout="0.055" 
--hidden_dropout="0.047" 
--save_total_limit="3"
--freeze_feature_encoder 
--feat_proj_dropout="0.04" 
--mask_time_prob="0.082" 
--mask_time_length="10" 
--mask_feature_prob="0.25" 
--mask_feature_length="64" 
--gradient_checkpointing
--min_duration_in_seconds="0.5" 
--max_duration_in_seconds="30.0" 
--ctc_zero_infinity=True 
--use_auth_token 
--seed="42" 
--fp16 
--group_by_length 
--do_train --do_eval 
--push_to_hub 
--preprocessing_num_workers="16"

Using these settings, training may take 3 - 4 days on an average GPU. However, you can obtain a decent model more quickly by adjusting these parameters.

Parameter	Comment
per_device_train_batch_size	Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system
gradient_accumulation_steps	Can be adjusted even further up to increase batch size and speed up training without running into memory issues
learning_rate	Can be increased, maybe as high as 1e - 4. Speeds up training but might add instability
epochs	Can be decreased significantly. This is a huge dataset and you might get a decent result already after a couple of epochs

Citation

@inproceedings{de-la-rosa-etal-2023-boosting,
    title = "Boosting {N}orwegian Automatic Speech Recognition",
    author = "De La Rosa, Javier  and
      Braaten, Rolv-Arild  and
      Kummervold, Per  and
      Wetjen, Freddy",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.55",
    pages = "555--564",
    abstract = "In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10{\%} to 7.60{\%}, with models achieving 5.81{\%} for Bokm{\aa}l and 11.54{\%} for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.",
}

See https://arxiv.org/abs/2307.01672

🔧 Technical Details

The model is fine - tuned on top of the feature extractor XLS - R from Facebook/Meta. It uses a 5 - gram KenLM to improve the results on the test set. The training parameters are carefully selected to balance performance and training time.

📄 License

This model is released under the Apache 2.0 license.

Property	Details
Model Type	Norwegian Wav2Vec2 Model - 1B Nynorsk
Training Data	NbAiLab/NPSC

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご