nb-wav2vec2-300m-nynorsk Open-source Speech Recognition Model - Accurately Recognize New Norwegian (Nynorsk) Speech

Nb Wav2vec2 300m Nynorsk

Developed by NbAiLab

A 300M-parameter speech recognition model fine-tuned on the VoxRex feature extractor, optimized for Nynorsk (New Norwegian), achieving a WER of 12.22% on the NPSC test set

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Norwegian speech recognition #Low word error rate (WER)#Parliamentary speech transcription

Downloads 73.53k

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system optimized for Nynorsk, built on the Wav2Vec2 architecture and fine-tuned on the Norwegian Parliamentary Speech Corpus (NPSC).

Model Features

Language model enhancement

Integration of a 5-gram KenLM language model reduces the word error rate (WER) by 20.5% relatively

Efficient training

Optimized parameter configuration enables model training on standard GPUs within 3-4 days

Multi-model support

Forms a Norwegian ASR solution matrix alongside the team's Bokmål language model

Model Capabilities

Nynorsk speech-to-text conversion

Long audio segment processing (up to 30 seconds)

Low-resource language support

Use Cases

Government services

Automated parliamentary records

Automatic transcription of Norwegian parliamentary meeting recordings into text records

Test set character error rate as low as 4.19%

Education

Dialect preservation

Used for digital preservation of Nynorsk dialect materials

🚀 Norwegian Wav2Vec2 Model - 300M - VoxRex - Nynorsk

This model is designed for automatic speech recognition in Nynorsk. It's finetuned on a feature extractor from the National Library of Sweden, achieving great results on the test set.

🚀 Quick Start

To reproduce the results of this model, follow these steps:

First, verify that you can train a simple Swedish model by following instructions from 🤗.
Create a new repo, then copy the files run.sh and run_speech_recognition_ctc.py from our repo. Running these will generate all the necessary files for reproducing our results. You might even build a better ASR by tweaking the hyperparameters.

✨ Features

High Performance: Achieves a WER of 0.1222 and a CER of 0.0419 on the test set with a 5 - gram KenLM.
Efficient Training: The finetuning process is not very computationally demanding and can be completed in a few days on an average GPU.
Language Model Support: Adding a simple 5 - gram language model can significantly improve the results.

📦 Installation

To train your own model, you need to set up the environment and run the training script with the appropriate parameters. Here are the parameters used for the final model:

--dataset_name="NbAiLab/NPSC" 
--model_name_or_path="KBLab/wav2vec2-large-voxrex" 
--dataset_config_name="16K_mp3_nynorsk" 
--output_dir="./" 
--overwrite_output_dir 
--num_train_epochs="80" 
--per_device_train_batch_size="16" 
--per_device_eval_batch_size="16" 
--gradient_accumulation_steps="2" 
--learning_rate="1e-4" 
--warmup_steps="2000" 
--length_column_name="input_length" 
--evaluation_strategy="steps" 
--text_column_name="text" 
--save_steps="500" 
--eval_steps="500" 
--logging_steps="100" 
--layerdrop="0.041" 
--attention_dropout="0.094" 
--activation_dropout="0.055" 
--hidden_dropout="0.047" 
--save_total_limit="3" 
--freeze_feature_encoder 
--feat_proj_dropout="0.04" 
--mask_time_prob="0.082" 
--mask_time_length="10" 
--mask_feature_prob="0.25" 
--mask_feature_length="64" 
--gradient_checkpointing 
--min_duration_in_seconds="0.5" 
--max_duration_in_seconds="30.0" 
--use_auth_token 
--seed="42" 
--fp16 
--group_by_length 
--do_train --do_eval 
--push_to_hub 
--preprocessing_num_workers="32"

📚 Documentation

Model Description

This is one of several Wav2Vec - models created by our team during the 🤗 hosted Robust Speech Event. Here is a list of our models and their final scores:

Model	Final WER
NbAiLab/nb-wav2vec2-1b-bokmaal	6.33
NbAiLab/nb-wav2vec2-300m-bokmaal	7.03
NbAiLab/nb-wav2vec2-1b-nynorsk	11.32
NbAiLab/nb-wav2vec2-300m-nynorsk (this model)	12.22

Dataset

The team converted the Norwegian Parliamentary Speech Corpus (NPSC) to the NbAiLab/NPSC in 🤗 Dataset format and used it as the main training source.

Team

The following people contributed to building this model: Rolv - Arild Braaten, Per Egil Kummervold, Andre Kåsen, Javier de la Rosa, Per Erik Solberg, and Freddy Wetjen.

Training Procedure

We recommend following the instructions from 🤗 to train a simple Swedish model first. Then, create a new repo and copy the necessary files from our repo to reproduce the results.

Language Model

Adding a simple 5 - gram language model can improve the results. 🤗 has a blog explaining how to add a 5 - gram language model to the ASR model. You can build the model from your own corpus or copy the 5 - gram model from this repo.

Parameters

Here are some comments on the training parameters:

Parameter	Comment
per_device_train_batch_size	Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system.
gradient_accumulation_steps	Can be adjusted further up to increase batch size and speed up training without memory issues.
learning_rate	Can be increased, maybe as high as 1e - 4. Speeds up training but might add instability.
epochs	Can be decreased significantly. This is a large dataset and you might get decent results after a few epochs.

🔧 Technical Details

The model is finetuned on top of the feature extractor VoxRex - model from the National Library of Sweden. The finetuning process uses the specified parameters to achieve the reported results.

📄 License

This model is released under the Apache - 2.0 license.

📊 Model Index

Property	Details
Model Name	nb - wav2vec2 - 300m - nynorsk
Task	Automatic Speech Recognition
Dataset	NPSC (NbAiLab/NPSC with args 16K_mp3_nynorsk)
Metrics	Test (Nynorsk) WER: 0.1222; Test (Nynorsk) CER: 0.0419

📖 Citation

@inproceedings{de-la-rosa-etal-2023-boosting,
    title = "Boosting {N}orwegian Automatic Speech Recognition",
    author = "De La Rosa, Javier  and
      Braaten, Rolv-Arild  and
      Kummervold, Per  and
      Wetjen, Freddy",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.55",
    pages = "555--564",
    abstract = "In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10{\%} to 7.60{\%}, with models achieving 5.81{\%} for Bokm{\aa}l and 11.54{\%} for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.",
}

See https://arxiv.org/abs/2307.01672

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご