🚀 German Speech Recognition Pipeline
This project offers a fully - trained German speech recognition pipeline. It combines an acoustic model based on the wav2vec 2.0 XLS - R 1B TEVR architecture with a 5 - gram KenLM language model, achieving a highly competitive performance on the CommonVoice German dataset.
🚀 Quick Start
To evaluate this pipeline on your own data, you can follow the steps in the HF Eval Script.ipynb
Jupyter Notebook or use the provided Python script.
✨ Features
- Advanced Architecture: Utilizes the new wav2vec 2.0 XLS - R 1B TEVR architecture for the acoustic model.
- Language Model: Incorporates a 5 - gram KenLM language model to enhance recognition accuracy.
- High Performance: Achieves a word error rate of 3.64% and a character error rate of 1.54% on the CommonVoice German dataset.
📚 Documentation
Overview
This folder contains a fully trained German speech recognition pipeline consisting of an acoustic model using the new wav2vec 2.0 XLS - R 1B TEVR architecture and a 5 - gram KenLM language model. For an explanation of the TEVR enhancements and their motivation, please see our paper: TEVR: Improving Speech Recognition by Token Entropy Variance Reduction.
This pipeline scores a very competitive (as of June 2022) word error rate of 3.64% on CommonVoice German. The character error rate was 1.54%.
Citation
If you use this ASR pipeline for research, please cite:
@misc{https://doi.org/10.48550/arxiv.2206.12693,
doi = {10.48550/ARXIV.2206.12693},
url = {https://arxiv.org/abs/2206.12693},
author = {Krabbenhöft, Hajo Nils and Barth, Erhardt},
keywords = {Computation and Language (cs.CL), Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, F.2.1; I.2.6; I.2.7},
title = {TEVR: Improving Speech Recognition by Token Entropy Variance Reduction},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
TEVR Tokenizer Creation / Testing
See https://huggingface.co/fxtentacle/tevr-token-entropy-predictor-de for:
- our trained ByT5 model used to calculate the entropies in the paper
- a Jupyter Notebook to generate a TEVR Tokenizer from a text corpus
- a Jupyter Notebook to generate the illustration image in the paper
Evaluation
To evaluate this pipeline yourself and/or on your own data, see the HF Eval Script.ipynb
Jupyter Notebook or use the following python script:
💻 Usage Examples
Basic Usage
!pip install --quiet --root-user-action=ignore --upgrade pip
!pip install --quiet --root-user-action=ignore "datasets>=1.18.3" "transformers==4.11.3" librosa jiwer huggingface_hub
!pip install --quiet --root-user-action=ignore https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
!pip install --quiet --root-user-action=ignore --upgrade transformers
!pip install --quiet --root-user-action=ignore torch_audiomentations audiomentations
from datasets import load_dataset, Audio, load_metric
from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
import torchaudio.transforms as T
import torch
import unicodedata
import numpy as np
import re
testing_dataset = load_dataset("common_voice", "de", split="test")
allchars = list(set([c for t in testing_dataset['sentence'] for c in list(t)]))
map_to_space = [c for c in allchars if unicodedata.category(c)[0] in 'PSZ' and c not in 'ʻ-']
replacements = ''.maketrans(''.join(map_to_space), ''.join(' ' for i in range(len(map_to_space))), '\'ʻ')
def text_fix(text):
text = text.replace('ß','ss')
text = text.replace('-',' ').replace(' ',' ').replace(' ',' ')
text = text.lower()
text = text.translate(replacements).strip()
text = re.sub("[âşěýňעảנźțãòàǔł̇æồאắîשðșęūāñë生בøúıśžçćńřğ]+","?",text)
text = ' '.join([w for w in text.split(' ') if w != ''])
return text
model = AutoModelForCTC.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")
model.to('cuda')
class HajoProcessor(Wav2Vec2ProcessorWithLM):
@staticmethod
def get_missing_alphabet_tokens(decoder, tokenizer):
return []
processor = HajoProcessor.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")
def predict_single_audio(batch, image=False):
audio = batch['audio']['array']
if batch['audio']['sampling_rate'] != 16000:
audio = T.Resample(orig_freq=batch['audio']['sampling_rate'], new_freq=16000)(torch.from_numpy(audio)).numpy()
audio = (audio - audio.mean()) / np.sqrt(audio.var() + 1e-7)
input_values = processor(audio, return_tensors="pt", sampling_rate=16_000).input_values
with torch.no_grad():
logits = model(input_values.to('cuda')).logits.cpu().numpy()[0]
decoded = processor.decode(logits, beam_width=500)
return { 'groundtruth': text_fix(batch['sentence']), 'prediction': decoded.text }
all_predictions = testing_dataset.map(predict_single_audio, remove_columns=testing_dataset.column_names)
print('WER', load_metric("wer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')
print('CER', load_metric("cer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')
Results
WER 3.6433399042523233 %
CER 1.5398893560981173 %
📄 License
This project is licensed under the Apache 2.0 license.
📦 Information
Property |
Details |
Language |
German |
Datasets |
common_voice |
Inference |
false |
Metrics |
wer, cer |
Tags |
audio, automatic - speech - recognition, speech, hf - asr - leaderboard |
Model Name |
wav2vec 2.0 XLS - R 1B + TEVR tokens + 5 - gram LM by Hajo Nils Krabbenhöft |
Task |
Speech Recognition (automatic - speech - recognition) |
Dataset Name |
Common Voice de |
Test WER |
3.6433399042523233 |
Test CER |
1.5398893560981173 |
License |
apache - 2.0 |