Open-source German speech recognition model wav2vec2-xls-r-1b-tevr, accurately recognize speech with low error rate

Wav2vec2 Xls R 1b Tevr

Developed by fxtentacle

This is a German speech recognition model based on the wav2vec 2.0 XLS-R 1B architecture, incorporating TEVR (Token Entropy Variance Reduction) technology and combined with a 5-gram language model. It achieves a word error rate of 3.64% on the Common Voice German test set.

Speech Recognition

Transformers

GermanOpen Source License:Apache-2.0 #German Speech Recognition #TEVR Enhancement Technology #Ultra-low Word Error Rate

Downloads 311

Release Time : 6/2/2022

Model Overview

This model is a high-performance German automatic speech recognition system, optimized for token generation through TEVR technology, significantly improving recognition accuracy.

Model Features

TEVR Technology Enhancement

Optimizes speech recognition performance through Token Entropy Variance Reduction technology, improving model accuracy.

High-Performance Language Model Integration

Combined with a 5-gram KenLM language model, significantly reducing recognition error rates.

German Language Optimization

Specifically optimized for German speech characteristics, handling unique German characters and pronunciations.

Model Capabilities

German Speech-to-Text

High-Accuracy Speech Recognition

Real-Time Speech Processing

Use Cases

Speech Transcription

German Meeting Minutes

Automatically convert German meeting recordings into text transcripts.

Word error rate as low as 3.64%

Voice Assistant

Provides high-accuracy speech recognition capabilities for German voice assistants.

Accessibility Technology

Real-Time Caption Generation

Generates real-time captions for German video content.

🚀 German Speech Recognition Pipeline

This project offers a fully - trained German speech recognition pipeline. It combines an acoustic model based on the wav2vec 2.0 XLS - R 1B TEVR architecture with a 5 - gram KenLM language model, achieving a highly competitive performance on the CommonVoice German dataset.

🚀 Quick Start

To evaluate this pipeline on your own data, you can follow the steps in the HF Eval Script.ipynb Jupyter Notebook or use the provided Python script.

✨ Features

Advanced Architecture: Utilizes the new wav2vec 2.0 XLS - R 1B TEVR architecture for the acoustic model.
Language Model: Incorporates a 5 - gram KenLM language model to enhance recognition accuracy.
High Performance: Achieves a word error rate of 3.64% and a character error rate of 1.54% on the CommonVoice German dataset.

📚 Documentation

Overview

This folder contains a fully trained German speech recognition pipeline consisting of an acoustic model using the new wav2vec 2.0 XLS - R 1B TEVR architecture and a 5 - gram KenLM language model. For an explanation of the TEVR enhancements and their motivation, please see our paper: TEVR: Improving Speech Recognition by Token Entropy Variance Reduction.

This pipeline scores a very competitive (as of June 2022) word error rate of 3.64% on CommonVoice German. The character error rate was 1.54%.

Citation

If you use this ASR pipeline for research, please cite:

@misc{https://doi.org/10.48550/arxiv.2206.12693,
  doi = {10.48550/ARXIV.2206.12693},
  url = {https://arxiv.org/abs/2206.12693},
  author = {Krabbenhöft, Hajo Nils and Barth, Erhardt},  
  keywords = {Computation and Language (cs.CL), Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, F.2.1; I.2.6; I.2.7},  
  title = {TEVR: Improving Speech Recognition by Token Entropy Variance Reduction},  
  publisher = {arXiv},  
  year = {2022}, 
  copyright = {Creative Commons Attribution 4.0 International}
}

TEVR Tokenizer Creation / Testing

See https://huggingface.co/fxtentacle/tevr-token-entropy-predictor-de for:

our trained ByT5 model used to calculate the entropies in the paper
a Jupyter Notebook to generate a TEVR Tokenizer from a text corpus
a Jupyter Notebook to generate the illustration image in the paper

Evaluation

To evaluate this pipeline yourself and/or on your own data, see the HF Eval Script.ipynb Jupyter Notebook or use the following python script:

💻 Usage Examples

Basic Usage

!pip install --quiet --root-user-action=ignore --upgrade pip
!pip install --quiet --root-user-action=ignore "datasets>=1.18.3" "transformers==4.11.3" librosa jiwer huggingface_hub  
!pip install --quiet --root-user-action=ignore https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
!pip install --quiet --root-user-action=ignore --upgrade transformers
!pip install --quiet --root-user-action=ignore torch_audiomentations audiomentations

from datasets import load_dataset, Audio, load_metric
from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
import torchaudio.transforms as T
import torch
import unicodedata
import numpy as np
import re

# load testing dataset 
testing_dataset = load_dataset("common_voice", "de", split="test")

# replace invisible characters with space
allchars = list(set([c for t in testing_dataset['sentence'] for c in list(t)]))
map_to_space = [c for c in allchars if unicodedata.category(c)[0] in 'PSZ' and c not in 'ʻ-']
replacements = ''.maketrans(''.join(map_to_space), ''.join(' ' for i in range(len(map_to_space))), '\'ʻ')

def text_fix(text):
    # change ß to ss
    text = text.replace('ß','ss')
    # convert dash to space and remove double-space
    text = text.replace('-',' ').replace('  ',' ').replace('  ',' ')
    # make lowercase
    text = text.lower()
    # remap all invisible characters to space
    text = text.translate(replacements).strip()
    # for easier comparison to Zimmermeister, replace unrepresentable characters with ?
    text = re.sub("[âşěýňעảנźțãòàǔł̇æồאắîשðșęūāñë生בøúıśžçćńřğ]+","?",text)
    # remove multiple spaces (again)
    text = ' '.join([w for w in text.split(' ') if w != ''])
    return text

# load model
model = AutoModelForCTC.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")
model.to('cuda')
# load processor
class HajoProcessor(Wav2Vec2ProcessorWithLM):
    @staticmethod
    def get_missing_alphabet_tokens(decoder, tokenizer):
        return []
processor = HajoProcessor.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")

# this function will be called for each WAV file
def predict_single_audio(batch, image=False):    
    audio = batch['audio']['array']
    # resample, if needed
    if batch['audio']['sampling_rate'] != 16000:
        audio = T.Resample(orig_freq=batch['audio']['sampling_rate'], new_freq=16000)(torch.from_numpy(audio)).numpy()
    # normalize
    audio = (audio - audio.mean()) / np.sqrt(audio.var() + 1e-7)
    # ask HF processor to prepare audio for GPU eval
    input_values = processor(audio, return_tensors="pt", sampling_rate=16_000).input_values
    # call model on GPU
    with torch.no_grad():
        logits = model(input_values.to('cuda')).logits.cpu().numpy()[0]
    # ask HF processor to decode logits
    decoded = processor.decode(logits, beam_width=500)
    # return as dictionary
    return { 'groundtruth': text_fix(batch['sentence']), 'prediction': decoded.text }

# process all audio files
all_predictions = testing_dataset.map(predict_single_audio, remove_columns=testing_dataset.column_names)

# print results
print('WER', load_metric("wer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')
print('CER', load_metric("cer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')

Results

WER 3.6433399042523233 %
CER 1.5398893560981173 %

📄 License

This project is licensed under the Apache 2.0 license.

📦 Information

Property	Details
Language	German
Datasets	common_voice
Inference	false
Metrics	wer, cer
Tags	audio, automatic - speech - recognition, speech, hf - asr - leaderboard
Model Name	wav2vec 2.0 XLS - R 1B + TEVR tokens + 5 - gram LM by Hajo Nils Krabbenhöft
Task	Speech Recognition (automatic - speech - recognition)
Dataset Name	Common Voice de
Test WER	3.6433399042523233
Test CER	1.5398893560981173
License	apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご