wav2vec2-xls-r-1b-german: An Open-Source German Speech Recognition Model - Accurately Identify German Speech Content

Wav2vec2 Xls R 1b German

Developed by jonatasgrosman

This is a German automatic speech recognition model based on the XLS-R 1B architecture, fine-tuned on multiple German speech datasets including Common Voice 8.0

Speech Recognition

Transformers

GermanOpen Source License:Apache-2.0 #German Speech Recognition #High-precision WER #Multi-dataset Training

Downloads 105

Release Time : 3/2/2022

Model Overview

This model is specifically optimized for German speech recognition tasks, capable of converting German speech to text, supporting audio input with a 16kHz sampling rate

Model Features

High-performance German Recognition

Achieves 10.95% WER and 2.72% CER on the Common Voice 8.0 test set

Language Model Enhancement

With language model integration, WER can be reduced to 8.13% and CER to 2.18%

Multi-dataset Training

Trained using multiple datasets including Common Voice 8.0, Multilingual TEDx, Multilingual LibriSpeech, and Voxpopuli

Model Capabilities

German Speech Recognition

Automatic Speech-to-Text

Supports 16kHz Sampling Rate Audio Processing

Use Cases

Speech Transcription

German Speech Transcription

Convert German speech content into text format

Achieves over 90% accuracy on the Common Voice test set

Voice Assistants

German Voice Command Recognition

Used for voice command recognition in German voice assistants or smart home devices

🚀 Fine-tuned XLS-R 1B model for speech recognition in German

This is a fine - tuned model based on facebook/wav2vec2-xls-r-1b for German speech recognition, which uses multiple datasets for training. It can accurately transcribe German speech when the input is sampled at 16kHz.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-xls-r-1b on German. It uses the train and validation splits of Common Voice 8.0, Multilingual TEDx, Multilingual LibriSpeech, and Voxpopuli. When using this model, ensure that your speech input is sampled at 16kHz.

This model has been fine - tuned by the HuggingSound tool, and thanks to the GPU credits generously given by the OVHcloud.

✨ Features

Language Support: Specialized for German speech recognition.
Data Sources: Trained on multiple high - quality datasets, including Common Voice 8.0, Multilingual TEDx, Multilingual LibriSpeech, and Voxpopuli.
Fine - tuning Tool: Fine - tuned using the HuggingSound tool.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-german")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "de"
MODEL_ID = "jonatasgrosman/wav2vec2-xls-r-1b-german"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

📚 Documentation

Evaluation Commands

To evaluate on mozilla-foundation/common_voice_8_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-german --dataset mozilla-foundation/common_voice_8_0 --config de --split test

To evaluate on speech-recognition-community-v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-german --dataset speech-recognition-community-v2/dev_data --config de --split validation --chunk_length_s 5.0 --stride_length_s 1.0

Model Information

Property	Details
Model Type	Fine - tuned XLS - R 1B model for German speech recognition
Training Data	Common Voice 8.0, Multilingual TEDx, Multilingual LibriSpeech, Voxpopuli

Results

Task	Dataset	Metrics	Value
Automatic Speech Recognition	Common Voice 8	Test WER	10.95
Automatic Speech Recognition	Common Voice 8	Test CER	2.72
Automatic Speech Recognition	Common Voice 8	Test WER (+LM)	8.13
Automatic Speech Recognition	Common Voice 8	Test CER (+LM)	2.18
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev WER	22.68
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev CER	9.17
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev WER (+LM)	17.07
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev CER (+LM)	8.45
Automatic Speech Recognition	Robust Speech Event - Test Data	Test WER	19.67

📄 License

This model is licensed under the Apache - 2.0 license.

📚 Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr-1b-german,
  title={Fine-tuned {XLS-R} 1{B} model for speech recognition in {G}erman},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-german}},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご