Open-source model xls-r-2b-nl-v2_lm-5gram-os2_hunspell - Free implementation of automatic speech recognition for Dutch and Flemish

Xls R 2b Nl V2 Lm 5gram Os2 Hunspell

Developed by FremyCompany

A CTC model based on XLS-R with a 5-gram language model from Open Subtitles, primarily used for automatic speech recognition in Dutch and Flemish.

Speech Recognition

Transformers

Other#Dutch Speech Recognition #High Accuracy WER 3.93 #Robust Speech Processing

Downloads 18

Release Time : 3/2/2022

Model Overview

This model is a version of facebook/wav2vec2-xls-r-2b-22-to-16, fine-tuned mainly on the CGN dataset and the Dutch dataset from Common Voice 8.0, with the addition of a large 5-gram language model.

Model Features

High Accuracy Speech Recognition

Achieved high accuracy with WER 3.93 and CER 1.22 on the evaluation set of Common Voice 8.0.

Multi-language Support

Supports speech recognition for Dutch and its dialects (Belgian Dutch and Netherlands Dutch).

5-gram Language Model

A large 5-gram language model trained on the Open Subtitles Dutch corpus, significantly improving recognition accuracy.

Spelling Correction

Uses hunspell for spelling correction, further enhancing the accuracy of recognition results.

Model Capabilities

Dutch Speech Recognition

Belgian Dutch Speech Recognition

Netherlands Dutch Speech Recognition

High Accuracy Text Transcription

Use Cases

Speech-to-Text

Meeting Minutes

Convert Dutch or Flemish meeting recordings into text transcripts.

High accuracy transcription suitable for subsequent analysis and archiving.

Voice Assistant

Used as the speech recognition module for Dutch voice assistants.

Improves recognition accuracy and user experience for voice assistants.

Education

Language Learning

Helps learners practice Dutch pronunciation and receive instant feedback.

Provides accurate pronunciation assessment and text transcription.

🚀 XLS-R-based CTC model with 5-gram language model from Open Subtitles

This is an ASR model based on XLS-R, fine - tuned on multiple datasets. It adds a 5 - gram language model based on the Open Subtitles Dutch corpus, achieving excellent results in Dutch speech recognition.

🚀 Quick Start

Prerequisites

Evaluating this model requires apt install libhunspell-dev and a pip install of hunspell in addition to pip installs of pipy-kenlm and pyctcdecode (see install_requirements.sh). Also, the chunking lengths and strides were optimized for the model as 12s and 2s respectively (see eval.sh).

Usage

For best results, when using the model locally for inference, please use the code in the eval.py decoding script.

✨ Features

Fine - tuned on Multiple Datasets: The model is mainly fine - tuned on the CGN dataset and MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - NL dataset.
5 - gram Language Model: A large 5 - gram language model based on the Open Subtitles Dutch corpus is added to improve recognition accuracy.
Typo Handling: hunspell is used to handle typos, proposing alternative spellings and reranking them.

📦 Installation

To install the necessary dependencies, follow the instructions in install_requirements.sh:

# Install system dependencies
apt install libhunspell-dev
# Install Python dependencies
pip install hunspell pipy-kenlm pyctcdecode

📚 Documentation

Model description

The model takes 16kHz sound input, and uses a Wav2Vec2ForCTC decoder with 48 letters to output the letter - transcription probabilities per frame.

To improve accuracy, a beam - search decoder based on pyctcdecode is then used; it reranks the most promising alignments based on a 5 - gram language model trained on the Open Subtitles Dutch corpus.

To further deal with typos, hunspell is used to propose alternative spellings for words not in the unigrams of the language model. These alternatives are then reranked based on the language model trained above, and a penalty proportional to the levenshtein edit distance between the alternative and the recognized word. This for examples enables to correct collegas into collega's or gogol into google.

Intended uses & limitations

This model can be used to transcribe Dutch or Flemish spoken dutch to text (without punctuation).

Training and evaluation data

The model was: 0. initialized with the 2B parameter model from Facebook.

trained 5 epochs (6000 iterations of batch size 32) on the cv8/nl dataset.
trained 1 epoch (36000 iterations of batch size 32) on the cgn dataset.
trained 5 epochs (6000 iterations of batch size 32) on the cv8/nl dataset.

Framework versions

Property	Details
Transformers	4.16.0
Pytorch	1.10.2+cu102
Datasets	1.18.3
Tokenizers	0.11.0

🔧 Technical Details

Results on Evaluation Sets

Dataset	Task	Metrics	Value
Common Voice 8 (nl)	Automatic Speech Recognition	WER	0.03931
Common Voice 8 (nl)	Automatic Speech Recognition	CER	0.01224
Robust Speech Event - Dev Data (nl)	Automatic Speech Recognition	WER	16.35
Robust Speech Event - Dev Data (nl)	Automatic Speech Recognition	CER	9.64
Robust Speech Event - Test Data (nl)	Automatic Speech Recognition	WER	15.81

Important Notes

⚠️ Important Note

The hunspell typo fixer is not enabled on the website, which returns raw CTC+LM results. Hunspell reranking is only available in the eval.py decoding script. For best results, please use the code in that file while using the model locally for inference.

⚠️ Important Note

Evaluating this model requires apt install libhunspell-dev and a pip install of hunspell in addition to pip installs of pipy-kenlm and pyctcdecode (see install_requirements.sh); in addition, the chunking lengths and strides were optimized for the model as 12s and 2s respectively (see eval.sh).

⚠️ Quick Remark

The "Robust Speech Event" set does not contain cleaned transcription text, so its WER/CER are vastly over - estimated. For instance 2014 in the dev set is left as a number but will be recognized as tweeduizend veertien, which counts as 3 mistakes (2014 missing, and both tweeduizend and veertien wrongly inserted). Other normalization problems in the dev set include the presence of single quotes around some words, that then end up as non - match despite being the correct word (but without quotes), and the removal of some speech words in the final transcript (ja, etc...). As a result, our real error rate on the dev set is significantly lower than reported.

You can compare the predictions with the targets on the validation dev set yourself, for example using this diffing tool.

Acknowledgments

We would like to thank OVH for providing us with a V100S GPU.

Model Development Team

This model was developped during the Robust Speech Recognition challenge event by François REMY (twitter) and Geoffroy VANDERREYDT.

WE DO SPEECH RECOGNITION: Hello reader! If you are considering using this (or another) model in production, but would benefit from a model fine - tuned specifically for your use case (using text and/or labelled speech), feel free to contact our team.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご