The open-source Automatic Speech Recognition model xls-r-nl-v1-cv8-lm

Xls R Nl V1 Cv8 Lm

Developed by FremyCompany

This is an automatic speech recognition model based on the XLS-R architecture, specifically optimized for Dutch and Flemish, incorporating a 5-gram language model to improve recognition accuracy.

Speech Recognition

Transformers

Other#Dutch speech recognition #Low WER #5-gram language model

Downloads 14

Release Time : 3/2/2022

Model Overview

This model is primarily used to convert spoken Dutch or Flemish into text (without punctuation) and performs exceptionally well on the Common Voice 8.0 dataset.

Model Features

Multi-stage training

The model underwent four stages of training, including multilingual_librispeech, common_voice_8_0, and cgn datasets, ensuring adaptability to various speech scenarios.

5-gram language model enhancement

Incorporates a 5-gram language model trained on the Common Voice corpus, significantly improving recognition accuracy.

Low error rate

Achieved an outstanding WER of 6.69% and CER of 1.97% on the Common Voice 8.0 test set.

Model Capabilities

Dutch speech recognition

Flemish speech recognition

16kHz audio processing

Use Cases

Speech-to-text

Dutch speech transcription

Convert spoken Dutch into text format

WER 6.69%, CER 1.97% (Common Voice 8.0 test set)

Flemish speech transcription

Convert spoken Flemish into text format

Speech analysis

Speech event detection

Identify and analyze specific events in speech

WER 19.71-20.79% (Robust speech event dataset)

🚀 XLS-R-based CTC model with 5-gram language model from Common Voice

This model is a fine - tuned version of [facebook/wav2vec2 - xls - r - 2b - 22 - to - 16](https://huggingface.co/facebook/wav2vec2 - xls - r - 2b - 22 - to - 16) on Dutch datasets, with a 5 - gram language model added for better speech recognition.

🚀 Quick Start

This model can be used to transcribe Dutch or Flemish spoken Dutch to text (without punctuation).

✨ Features

The model takes 16kHz sound input and uses a Wav2Vec2ForCTC decoder with 48 letters to output the final result.
To improve accuracy, a beam decoder is used, and the beams are scored based on a 5 - gram language model trained on the Common Voice 8 corpus.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Details

The model is a version of [facebook/wav2vec2 - xls - r - 2b - 22 - to - 16](https://huggingface.co/facebook/wav2vec2 - xls - r - 2b - 22 - to - 16) fine - tuned mainly on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - NL dataset. A small 5 - gram language model is added based on the Common Voice training corpus. This model achieves the following results on the evaluation set (of Common Voice 8.0):

Wer: 0.0669
Cer: 0.0197

Training and Evaluation Data

The model was initialized with [the 2B parameter model from Facebook](facebook/wav2vec2 - xls - r - 2b - 22 - to - 16).
The model was then trained 2000 iterations (batch size 32) on the dutch configuration of the multilingual_librispeech dataset.
The model was then trained 2000 iterations (batch size 32) on [the nl configuration of the common_voice_8_0 dataset](https://huggingface.co/datasets/mozilla - foundation/common_voice_8_0).
The model was then trained 6000 iterations (batch size 32) on [the cgn dataset](https://taalmaterialen.ivdnt.org/download/tstc - corpus - gesproken - nederlands/).
The model was then trained 6000 iterations (batch size 32) on [the nl configuration of the common_voice_8_0 dataset](https://huggingface.co/datasets/mozilla - foundation/common_voice_8_0).

Framework Versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.2.dev0
Tokenizers 0.11.0

Model Index

Property	Details
Model Name	xls - r - nl - v1 - cv8 - lm
Task	Automatic Speech Recognition
Datasets	mozilla - foundation/common_voice_8_0, multilingual_librispeech
Metrics on Common Voice 8 (nl)	Test WER: 6.69; Test CER: 1.97
Metrics on Robust Speech Event - Dev Data (nl)	Test WER: 20.79; Test CER: 10.72
Metrics on Robust Speech Event - Test Data (nl)	Test WER: 19.71; Test CER: N/A

🔧 Technical Details

The model uses a Wav2Vec2ForCTC decoder with 48 letters to process 16kHz sound input. A beam decoder is used for better accuracy, and the beams are scored based on a 5 - gram language model trained on the Common Voice 8 corpus.

📄 License

No license information is provided in the original document, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご