đ XLS-R-based CTC model with 5-gram language model from Common Voice
This model is a fine - tuned version of [facebook/wav2vec2 - xls - r - 2b - 22 - to - 16](https://huggingface.co/facebook/wav2vec2 - xls - r - 2b - 22 - to - 16) on Dutch datasets, with a 5 - gram language model added for better speech recognition.
đ Quick Start
This model can be used to transcribe Dutch or Flemish spoken Dutch to text (without punctuation).
⨠Features
- The model takes 16kHz sound input and uses a Wav2Vec2ForCTC decoder with 48 letters to output the final result.
- To improve accuracy, a beam decoder is used, and the beams are scored based on a 5 - gram language model trained on the Common Voice 8 corpus.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Model Details
The model is a version of [facebook/wav2vec2 - xls - r - 2b - 22 - to - 16](https://huggingface.co/facebook/wav2vec2 - xls - r - 2b - 22 - to - 16) fine - tuned mainly on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - NL dataset. A small 5 - gram language model is added based on the Common Voice training corpus. This model achieves the following results on the evaluation set (of Common Voice 8.0):
Training and Evaluation Data
- The model was initialized with [the 2B parameter model from Facebook](facebook/wav2vec2 - xls - r - 2b - 22 - to - 16).
- The model was then trained
2000
iterations (batch size 32) on the dutch
configuration of the multilingual_librispeech
dataset.
- The model was then trained
2000
iterations (batch size 32) on [the nl
configuration of the common_voice_8_0
dataset](https://huggingface.co/datasets/mozilla - foundation/common_voice_8_0).
- The model was then trained
6000
iterations (batch size 32) on [the cgn
dataset](https://taalmaterialen.ivdnt.org/download/tstc - corpus - gesproken - nederlands/).
- The model was then trained
6000
iterations (batch size 32) on [the nl
configuration of the common_voice_8_0
dataset](https://huggingface.co/datasets/mozilla - foundation/common_voice_8_0).
Framework Versions
- Transformers 4.17.0.dev0
- Pytorch 1.10.2+cu102
- Datasets 1.18.2.dev0
- Tokenizers 0.11.0
Model Index
Property |
Details |
Model Name |
xls - r - nl - v1 - cv8 - lm |
Task |
Automatic Speech Recognition |
Datasets |
mozilla - foundation/common_voice_8_0, multilingual_librispeech |
Metrics on Common Voice 8 (nl) |
Test WER: 6.69; Test CER: 1.97 |
Metrics on Robust Speech Event - Dev Data (nl) |
Test WER: 20.79; Test CER: 10.72 |
Metrics on Robust Speech Event - Test Data (nl) |
Test WER: 19.71; Test CER: N/A |
đ§ Technical Details
The model uses a Wav2Vec2ForCTC decoder with 48 letters to process 16kHz sound input. A beam decoder is used for better accuracy, and the beams are scored based on a 5 - gram language model trained on the Common Voice 8 corpus.
đ License
No license information is provided in the original document, so this section is skipped.