wav2vec2-large-xlsr-53-german Open-source German Speech Recognition Model - Precise Speech Recognition with Low Error Rates

Wav2vec2 Large Xlsr 53 German

Developed by Noricum

German speech recognition model fine-tuned on the wav2vec-large-xlsr-53 framework, achieving 11.26% word error rate on the CommonVoice German test set

Speech Recognition #German speech recognition #Low Word Error Rate (WER)#CommonVoice fine-tuning

Downloads 33

Release Time : 3/2/2022

Model Overview

This model is specifically designed for German speech transcription tasks, converting German speech into text, suitable for applications requiring automatic speech recognition

Model Features

High Accuracy

Achieves a low word error rate (WER) of 11.26% on the CommonVoice German test set

Based on Large-Scale Pretrained Model

Fine-tuned on the wav2vec-large-xlsr-53 framework, inheriting powerful speech feature extraction capabilities

Optimized for German

Specifically optimized and trained for German speech characteristics

Model Capabilities

German speech recognition

Speech-to-Text

Automatic speech transcription

Use Cases

Speech Transcription Services

Automated Meeting Minutes

Automatically convert German meeting recordings into text transcripts

Accuracy approximately 88.74%

Voice Assistants

Provide speech recognition capabilities for German voice assistants

Accessibility Services

Real-time Caption Generation

Generate real-time captions for German video content

🚀 Wav2vec2 German Model

This model is fine - tuned on wav2vec - large - xlsr - 53 using the German CommonVoice dataset, enabling accurate German speech transcription.

🚀 Quick Start

This model is fine - tuned on the wav2vec - large - xlsr - 53 model with the German CommonVoice dataset. It achieves a 11.26 Word Error Rate (WER) on the full test dataset. The training was mainly based on the code provided by Max Idahl, with minor adjustments in data preprocessing and training parameters.

✨ Features

High - accuracy Transcription: Achieves a 11.26 WER on the full test dataset.
Customizable Usage: Allows users to transcribe their own audio files with specific input requirements.

📦 Installation

To use this model, you need to install the necessary libraries:

!pip3 install transformers torch soundfile

💻 Usage Examples

Basic Usage

You can use the following code to transcribe your own audio files. Note that your input file must be a *.wav file, encoded at 16 kHz and be single - channel. To convert an audio file using ffmpeg, use the command: "ffmpeg -i input.wav -ar 16000 -ac 1 output.wav". The transcription process is memory - intensive (around 10GB per 10 seconds). If the script ends with "Killed", it means the Python interpreter ran out of memory. In this case, try using a shorter audio file.

# !pip3 install transformers torch soundfile
import soundfile as sf
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

# load pretrained model
tokenizer = Wav2Vec2Tokenizer.from_pretrained("Noricum/wav2vec2-large-xlsr-53-german")
model = Wav2Vec2ForCTC.from_pretrained("Noricum/wav2vec2-large-xlsr-53-german")

#load audio
audio_input, _ = sf.read("/path/to/your/audio.wav")

# transcribe
input_values = tokenizer(audio_input, return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)[0]
print(str(transcription))

Advanced Usage

To evaluate the model on the full CommonVoice test dataset, run the following script:

import re
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "de", split="test") # use "test[:1%]" for 1% sample
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("Noricum/wav2vec2-large-xlsr-53-german")
model = Wav2Vec2ForCTC.from_pretrained("Noricum/wav2vec2-large-xlsr-53-german")
model.to("cuda")

chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\"\\\\“]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=4) # batch_size=8 -> requires ~14.5GB GPU memory

# Chunked version, see https://discuss.huggingface.co/t/spanish-asr-fine-tuning-wav2vec2/4586/5:
import jiwer

def chunked_wer(targets, predictions, chunk_size=None):
    if chunk_size is None: return jiwer.wer(targets, predictions)
    start = 0
    end = chunk_size
    H, S, D, I = 0, 0, 0, 0
    while start < len(targets):
        chunk_metrics = jiwer.compute_measures(targets[start:end], predictions[start:end])
        H = H + chunk_metrics["hits"]
        S = S + chunk_metrics["substitutions"]
        D = D + chunk_metrics["deletions"]
        I = I + chunk_metrics["insertions"]
        start += chunk_size
        end += chunk_size
    return float(S + D + I) / float(H + S + D)

print("Total (chunk_size=1000), WER: {:2f}".format(100 * chunked_wer(result["pred_strings"], result["sentence"], chunk_size=1000)))

Output: Total (chunk_size = 1000), WER: 11.256522

⚠️ Important Note

Your input file must be *.wav, encoded in 16 kHz and be single channel. The transcribe process is very memory consuming (around 10GB per 10 seconds). If the script ends with "Killed" it means the Python interpreter ran out of memory. In this case, try with a shorter audio file.

💡 Usage Tip

To convert an audio file using ffmpeg, use: "ffmpeg -i input.wav -ar 16000 -ac 1 output.wav".

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご