Open-source model wav2vec2-large-xlsr-53-german-with-lm - A powerful assistant for efficient and accurate German speech recognition

Wav2vec2 Large Xlsr 53 German With Lm

Developed by aware-ai

This is a German automatic speech recognition model based on the XLSR Wav2Vec2 architecture with language model support, excelling on the Common Voice German dataset.

Speech Recognition

Transformers

GermanOpen Source License:Apache-2.0 #German Speech Recognition #Low Word Error Rate #XLSR Fine-tuning

Downloads 19

Release Time : 3/2/2022

Model Overview

This model is designed for German speech recognition tasks, combining acoustic and language models to efficiently and accurately convert German speech into text.

Model Features

Low Word Error Rate

Achieves 5.75% WER and 1.90% CER on the Common Voice German test set.

Language Model Integration

Incorporates the kenlm language model to enhance recognition accuracy.

Based on XLSR Architecture

Utilizes the XLSR Wav2Vec2 model with large-scale self-supervised pretraining.

Model Capabilities

German Speech Recognition

Speech-to-Text

High-Accuracy Audio Transcription

Use Cases

Speech Transcription

German Speech Transcription

Convert German speech content into text format

Highly accurate transcription results with only 5.75% WER

Voice Assistants

German Voice Command Recognition

Used as a speech recognition component for German voice assistants or control systems

🚀 XLSR Wav2Vec2 German with LM

This model is designed for automatic speech recognition in German, leveraging the XLSR Wav2Vec2 architecture with a language model.

📦 Model Information

Property	Details
Model Type	XLSR Wav2Vec2 German with LM
Training Data	Common Voice
Metrics	WER, CER
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week, hf-asr-leaderboard
License	Apache-2.0

📊 Test Result

Model	WER	CER
flozi00/wav2vec2-large-xlsr-53-german-with-lm	5.7467896819046755%	1.8980142607670552%

📚 Documentation

Evaluation

The model can be evaluated as follows on the German test data of Common Voice.

import torchaudio.functional as F
import torch
from transformers import AutoModelForCTC, AutoProcessor
import re
from datasets import load_dataset, load_metric

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『', '』', '〝', '〟', '⟨', '⟩', '〜', '：', '！', '？', '♪', '؛', '/', '\\', 'º', '−', '^', 'ʻ', 'ˆ"]

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

counter = 0
wer_counter = 0
cer_counter = 0

def main():
    model = AutoModelForCTC.from_pretrained("flozi00/wav2vec2-large-xlsr-53-german-with-lm")
    processor = AutoProcessor.from_pretrained("flozi00/wav2vec2-large-xlsr-53-german-with-lm")

    wer = load_metric("wer")
    cer = load_metric("cer")

    ds = load_dataset("common_voice", "de", split="test")
    #ds = ds.select(range(100))

    def calculate_metrics(batch):
        global counter, wer_counter, cer_counter
        resampled_audio = F.resample(torch.tensor(batch["audio"]["array"]), 48_000, 16_000).numpy()

        input_values = processor(resampled_audio, return_tensors="pt", sampling_rate=16_000).input_values

        with torch.no_grad():
            logits = model(input_values).logits.numpy()[0]


        decoded = processor.decode(logits)
        pred = decoded.text

        ref = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()

        wer_result = wer.compute(predictions=[pred], references=[ref])
        cer_result = cer.compute(predictions=[pred], references=[ref])

        counter += 1
        wer_counter += wer_result
        cer_counter += cer_result

        print(f"WER: {(wer_counter/counter)*100} | CER: {(cer_counter/counter)*100}")

        return batch


    ds.map(calculate_metrics, remove_columns=ds.column_names)
    
main()

🙌 Credits

The Acoustic model is a copy of jonatasgrosman's model. I used it to train a matching kenlm language model.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご